Large Scale Similarity Queries for Geocoding

Data often have an implicit spatial aspect (e.g., a restaurant has a location even if the location is not stored in the database). Enriching data with explicit spatial references, called geocoding, adds value to the data and allows users to apply new interaction patterns, for example, visualizing the data on a map. In order to introduce spatial references into a data set, non-spatial attributes must be used to link the non-geocoded data to preexisting geocoded data, i.e., a join must be computed. Since in geocoding the joined datasets often originate from different sources, there may be no common key value. Then computing the join is challenging: exact join conditions (which are efficient and well studied) will fail since data items that represent the same real world object may differ, e.g., due to spelling mistakes or different coding conventions.

The goal of this PhD project is to advance the state of the art in processing similarity queries on large data volumes and the integration of similarity joins into GIS-enabled relational database systems like PostGIS or Oracle Spatial. Building advanced support for similarity queries into relational database systems has been an active research topic in recent years. New query primitives and algorithms that are integrated into the kernel of the database system allow users to conveniently execute queries that involve a mixture of similarity predicates, spatial predicates, and (traditional) exact predicates, thus greatly enhancing the query capabilities of the system. In addition to the conceptual development of new query primitives, new algorithms must be designed and implemented. The integration of new operators into database systems requires cardinality and cost estimates, reordering rules with other operators, and memory-aware techniques. The outcome of this PhD project are new relational operators that advance the geocoding capabilities of existing systems as well as a prototype implementation in PostgreSQL.