The Geography of Place Names in the United States
ISBN 978-85-88783-11-9
Authors
1Karimzadeh, M.
1Pennsylvania State University Email: mortezakz@gmail.com
Abstract
Automatic identification and geolocation of place names in text, also known as geoparsing, is a fundamental step in enabling spatial queries as well as conducting spatiotemporal analysis over textual data sources such as web content, social media posts, news stories or historical archives. Geolocation of place names to their corresponding geographic coordinates, however, is challenging because many different places share the same names. Various place name disambiguation (also known as toponym resolution) algorithms leverage spatial context such as other place names used in the same document to disambiguate the vague place names to their correct referents. Most such algorithms use ranking schemes based on population or geographic prominence to anchor the less ambiguous place names, such as Paris (the capital of France) and use them to disambiguate more ambiguous ones. While such approaches usually achieve relatively high precision and recall rates, such improvements in the performance can be attributed to the number of documents that generally talk about more prominent places. Many of the -mistakenly assumed- less ambiguous place names are in fact ambiguous themselves; such as Paris, France versus Paris, Texas, United States. Finding documents that refer to Paris, Texas or London, Ohio might be a challenge in a system whose place name disambiguation method relies heavily on spatial context and other “prominent” place names. Therefore, it is vital to establish an understanding of the spatial pattern of vague place names to achieve true spatial search and analysis over textual documents without attributing documents to highly populated places instead of smaller places with similar names. In this paper, we examine the spatial pattern of vague place names in the United States, using the GeoNames database of geographic names. We provide detailed statistics on the spatial distribution of such names in each state. The insights gained from our study have important impacts on the design of Geographic Information Retrieval (GIR) systems with better preservation of textual referents to relatively smaller places with vague names that are similar to better-known and highly populated places. Such GIR enables a realistic spatial analysis that is based on textual documents such as social media posts, news articles, historical archives and scientific documents.