The last article in this series discussed data filtering and how this is necessary to ensure the quality of analytics, as well as as to protect the system from any intentional or unintentional traffic floods. The previous discussion in the Mesh Scope implementation tried to balance the optimistic processing or flagging of information, with rejecting data only as a last resort. The prior post in the series is available at:
https://www.lemuridaelabs.com/post/meshtastic-mqtt-series---data-filtering
Within the Mesh Scope processing pipeline, there is a process to enrich data at receive time to ensure that additional useful context is available to both users and the system when seeking to calculate aggregates and other analytics. In this case, the IoT data received from Meshtastic is generally simple in nature, and this focus will be the enriching of the spatial information associated with the data reports.
The goal of this article is to provide insight into:
Through considering the real-time augmentation of data flowing into a system, developers can envision additional context or information that may be beneficial in data processing activities. Through merging context at the time of data ingestion, more complex pipelines can be built to route, process, and alert on a range of data considerations.
Data Enrichment Overview
The goal of enriching data, especially streaming data, is to augment and contextualize incoming data to better meet the needs of a system. This may be through incorporating other external information, cross referencing it with internal details, or calculating new information based on various algorithms.
Data enrichment in general can bring other benefits, including:
In a consumer-facing system the data enrichment may also aid processes such as customer insights and personalization, risk management, and enhance other business goals.
Ultimately, the data received through interfaces, partners, and other processes will have a limited amount of details and limited context for the data itself, and this impacts the ability of systems needing to incorporate, react, and respond to this received data. By enriching and augmenting data, it is the first step on the process of turning raw data into actionable information.
Mesh Scope Data Enrichment
As stated above, the process of data enrichment in general is simply adding or computing new information from the data received, in order to process it better, provide more context, or to better incorporate it into your local data model. The transformation process discussed earlier is the first step to this, ensuring that your information has been adjusted and converted into a structure better suited to your local application. Prior to processing the data however, we will seek to augment the data to aid in future analytics.
The primary enrichment performed in the Mesh Scope system is the augmentation of raw data received from the MQTT Meshtastic network with additional geospatial reference information, such as country, city, and other details from the reverse geo-mapping of the details. This information provides Lemuridade Labs the ability to aggregate data in a regional manner for comparing activities, data flow, or other metrics. It also lays the foundation for some additional features and activities discussed in a later post.
Geohash Coordinate Groupings
The processing of the geospatial data within the Meshtastic messages starts with a bucketing of coordinates into higher-level positions. For the Mesh Scope application, detailed coordinates are first rolled up to a grid coordinate model using the Geohash algorithm. This rollup allows Mesh Scope to process the geospatial characteristics of a position once for a given area, with the details of the Geohash encoding determining the size of the area.
Geohash is an encoding algorithm for converting precise latitude and longitude combinations into a string value representation, and the length of the string determines the size of the area covered. A Geohash string of 4 characters would cover an area of +/- 20km, meaning all positions in that area would generate the same 4 character hash. Essentially this becomes a geospatial bucket for all of these addresses. As letters are added to the hash, 5 characters becomes +/- 2.4km, and 6 characters becomes +/- 0.61km. A useful property of Geohash is it allows an application to group coordinates and to optimize data processing activities.
A reference overview on Geohash and it’s implementation and goals is found at:
https://en.wikipedia.org/wiki/Geohash
Note that there are a variety of other algorithms such as H3 (https://h3geo.org/) that have other very interesting and powerful properties. Although Geohash was well suited for the requirements of the Mesh Scope, others should be considered when evaluating global grid systems.
Now, why go through a bucketing process? The goal is to simply reduce repeated geospatial data retrievals and other activities related to positions essentially at the same logical position. Two coordinates may be tens of meters away, but for our purposes are essentially equivalent. They are likely in the same country, city, and other features that we are using to augment the data.
The other reason for this process is that position reports received via the Meshtastic MQTT broker are already blurred, in that the broker enforces a reduced position resolution on the public channels. Thus, the effort of working with a very precise position is somewhat wasted effort when the source position already has a reduction in positioning accuracy built in.
Geospatial Normalization
An example of the conversion and normalization of various points around a small area into a Geohash is shown below. In this case, the points are all logically similar. As Mesh Scope encodes the various positions into a 6-position Geohash, they all match the same record. The system then reverses the Geohash encoding to return the value at the center of the square, and uses this for the geospatial data lookups.
With this process, Mesh Scope is able to reduce the data lookups and storage required for augmenting the received coordinates. Although technically each individual position could be processed and evaluated, the underlying system requirements should be understood. For Mesh Scope, this would be a dramatic increase in processing work and overhead in a way that does not bring any value to the augmentation being performed. It is always important to align data processes and activities with the fundamental goals. In this case, the goal is to have enough precision to identify city, state/province, and country.
Geospatial Lookup
Next, using the normalized coordinates from the prior step, Mesh Scope uses an internal service to determine if additional data is available. If no information is yet defined for the coordinates in the system, the application will enqueue a lookup from an external data provider. Lemuridae Labs works with various geospatial data providers, or uses internal systems based on requirements, and in this case is leveraging data processed from Open Street Map (OSM). The OSM data is high quality and meets the needs of the data process.
It is possible that some positions may not resolve to the full position detailed desired, such as a point returned in an ocean or some other remote area. The applications threshold is simply that an accurate reverse geocoding has been performed, and that a successful response has been processed.
When the reverse geocoding processes has completed, the result is stored in the Mesh Scope database, and is merged into the appropriate records.
Data Enrichment
With many applications, the enriched data is needed at the point in time the data is received and still active in the processing pipelines. For Mesh Scope, this is not the case and the data can be “lazy” retrieved and processed. The data aggregates and enriched results are not a real-time component of the system, and so waiting a few seconds for the geospatial process to be completed is not a bottleneck or impact on the pipeline.
In many systems, this augmentation and enrichment process must be completed prior to continuing processing, and this can lead to back pressure in processing systems, and a more complex queue processing strategy with concurrent processors will need to be considered to avoid blocking the system while performing lookups.
Summary
This article discusses the goals of data enrichment and considerations when detailing the process. It highlights some benefits from a well thought out enrichment and augmentation process, and and discusses how to consider targeted optimizations to minimize the work required. Techniques such as geospatial bucketing with various algorithms can be a benefit to these workflows, and future articles will discuss some of the many other uses of technologies such as Geohash and H3.
The next article in this series will discuss the processing of this data into the Mesh Scope application, and how Lemuridae Labs builds on this to build interesting and new capabilities.