A Deeper Look at Premise’s Place Harmonizer Algorithm

A Deeper Look at Premise’s Place Harmonizer Algorithm


By Lou Paladino, Tim Schwuchow, Arni Sumarlidason | Director of Data and Analytics, Data Science Lead, Full-stack Senior Software Engineer

Premise Data has a global network of Contributors who are paid to provide ground-level insights through Premise’s mobile app. As of January 2020, over 1.4 million on-the-ground local citizen Contributors are paid to answer surveys and map places across the globe. On average, Contributors are submitting hundreds of thousands of responses to “tasks” per day. Many of those submissions entail location-based discovery places like restaurants, health facilities, schools, etc.  The data is used to better understand economic development and population’s access to health services. Such projects are of particular importance to Premise’s clients like The Bill & Melinda Gates Foundation, USAID and many others.

In order to process the sometimes hundreds of submissions that represent the one authoritative location for a particular facility, Premise’s data scientists have developed an algorithm called ‘Place Harmonizer’ (pH) to conflate the hundreds of submissions that represent one facility to one high confidence location to determine the authoritative location for that facility. 

Methods

Premise’s Contributors submit three pieces of data that enable Premise’s data scientists to algorithmically conflate many data points for the same location down to one authoritative location:

  1. Cartesian coordinate (latitude and longitude)
  2. The name of the place
  3. A photo of the place

Cartesian Coordinate

Example of how one would record a location within the Premise app

First, spatial clustering of latitude and longitude is used to group submissions together. This is aligned with Waldo Tobler’s First Law of Geography, which states that “everything is related to everything else, but near things are more related than distant things.” 

A variable to limit the maximum search distance for submission points (latitude, longitude) is defined for each type of facility. For example, the radius of submitted locations to be grouped for pharmacies is smaller than the maximum search radius for hospitals. Generally speaking, hospitals occupy a larger area of land, and are spaced further apart across a landscape. Pharmacies tend to occupy a smaller footprint and are spaced closer together across a landscape.

Name

Sample name submission for a place within the Premise app

Second, once a spatial cluster is defined, Premise’s place harmonization algorithm will analyze the names of the facility as submitted by the Premise Contributors for each cluster. This step allows the algorithm to determine if the submissions have a similar name and stay clustered or have a different name and should be split apart into a unique cluster (facility name). 

Given that Premise Contributors manually enter the name of the facility via a smartphone keyboard, the algorithm must sort out a wide range of names that might describe a facility. As a result, Premise’s data scientists have modified the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm to accommodate this phenomenon. 

The advantage of this modification allows the algorithm to be language-agnostic; in other words, it not only works when submissions are in English but also functions seamlessly in the 28 other languages in which our Contributors submit data (e.g. Arabic, Swahili, Tagalog, etc.).  

Photograph

Sample photo submission within the Premise app

Third, Contributors submit a photo of the outside of the facility. These photos are extremely valuable to represent the up-to-date condition of a place. The photos are also used for visual validation that the place is true to what our Contributors are actually submitting. 

Premise’s data scientists have also used machine learning object character recognition tools on the photos to automatically extract all text observed within the photo. That text has then been used to help validate the name of the place that our Contributors submit. Most importantly, the photos serve as a means to do a final verification that multiple submissions should be joined into one authoritative place.

Conclusion

Crowdsourced data continues to prove essential to discovery and ground-truthing important places around the world. Premise considers the conflated places using our internally developed Place Harmonizer algorithm as foundational to knowing where facilities are located and to getting places mapped comprehensively across a landscape.  

If you want to learn more about how you can deploy Premise for your organization please contact info@premise.com or visit our website, www.premise.com

About Lou Paladino, Tim Schwuchow, Arni Sumarlidason

Lou Paladino is the Director of Data and Analytics at Premise Data. He is a geographer by trade, and has spent the bulk of his career incubating programs centered around geospatial open-source data, and fusing it with client data to pull-out actionable insights.  Lou spends time outside of work trying to keep up with his kids while playing backyard sports.

Tim Schwuchow is a staff data science lead at Premise, where he develops projects that apply machine learning and other statistical techniques to minimize fraud, assure data quality, and synthesize data generated by Premise’s platform into products for end-users.  Prior to joining Premise, he has held positions in data science education, government, and the financial sector. When he’s not coding he tries to avoid middle altitudes by spending as much time as possible near the mountains or coasts. 

Arni Sumarlidason is a full-stack senior software engineer. Arni excels at engineering auto-scalable solutions that process millions of geospatial data points. Additionally, he uses a game development background to develop and successfully deploy geospatial focused UI/UX platforms in client spaces.  He is an exercise freak, and gets his best ideas while in the pool or at mile 20 on a bicycle.