In a previous post on Meshtastic data processing, the process of data transformation was explored. This process detailed how information was received by the Lemuridae Labs systems and converted into clean data for processing. Extending this, we will investigate some data filtering that occurs to reduce the impact that improper data will have on data processing and analytics. To review the prior post, simply go to:
https://www.lemuridaelabs.com/post/meshtastic-mqtt-series---data-transformation
During the Lemuridae Labs processing pipeline, there are several filtering actions taken to ensure that the system is properly receiving and evaluating information. Although not complex, these steps are essential to avoid negative impacts on a system, especially when receiving data from uncontrolled external sources. This post will review several filtering techniques related to data content validation, message flood identification, and blocklist checking.
The goal of this article is to provide insight into:
By understanding these filtering processes, developers working with MQTT-based IoT networks, and Meshtastic in particular, can ensure that their application is not unduly impacted by poor quality devices, bugs, or malicious behavior.
Data Filtering Overview
As noted, when information is received by the Lemuridae Labs services, some level of filtering and validation is performed prior to processing. Data can be syntatically valid, but semantically invalid. For example, a data type defined as an age could receive a valid integer, but the value may be negative. This filtering process ensures that only reasonably valid information is processed, and that any uncontrolled data streams or floods don’t impact the system.
Flood Detection
Before the Mesh Scope application processes information in depth, it checks to see if the sender is well behaved and acting within expected system limits. For this validation, Lemuridae Labs keeps a sliding window count of data reports from each source, and checks this when new messages are received. Through this sliding window process, the system is configured to look for surges in messages that indicate a poorly configured or malicious host, and will automatically block the further processing of sender messages.
In this implementation with Mesh Scope, the application has configured a series of windows of 60 seconds, and counts the prior 5 windows to see if a specific source has exceeded the configured limits. If so, the source is placed onto a temporary block list with a duration defined. Once the duration is expired, the block is automatically removed. If the sender has continued flooding the system, however, the block would be automatically re-added and the sender blocked once again.
This process works well to handle the dynamic nature of clients, and protects a client from being blocked long-term due to a misconfiguration or other problem.
In practice, several nodes on the Meshtastic MQTT broker have been seen to be rather poorly behaved, sending thousands of packets a minute, repeating the same information over and over again. This flood is an impact to any other user within that mesh topic and causes an unnecessary burden on other consumers of the Meshtastic MQTT messages.
The image above shows a portion of the time series data captured within Mesh Scope, showing a specific node publishing more than 5k messages per minute, while all other sources are averaging much closer to zero.
Block List Checking
When a client is placed on the block list as a temporary, or permanent configuration, the system will subsequently discard the packets and cease processing. Depending on the configuration the messages may be captured for archival purposes, or simply dropped and ignored.
Lemuridae Labs makes use of two types of block lists, one is dynamic and managed via the flood detection process described above, while the second is a static list. Although this second list is not frequently used, if there was a node that was abusive or problematic on the mesh, the system has the ability to ban the sending node from any processing, mapping, and other data analysis activities.
Data Validity Checking
For the Meshtastic IoT style data being published via the MQTT broker, much of the information is highly structured, minimizing the opportunity for data to be malformed. Some free text fields do not have semantic meaning, while other fields are solely based on client claims, and have no external validation.
One field that can be considered is the position, transmitted as an integer version of the latitude and longitude values. Meshtastic devices can report either GNSS-based position information, or manually entered positions. When a user enters a location, the system must validate this and avoid integer default values.
Reviewing the location coordinates, when converted from an integer representation, the latitude / longitude values should fall into standard ranges, which is -90/90 for latitude and -180/180 for longitude. An obvious filter is to identify and discard positions that fall outside of this range.
A secondary position filter should be considered, which is Null Island. This position is the location of (0,0), which is when the position is not defined and the latitude and longitude default to zero. This location is often transmitted by devices when starting up before a GNSS position is received, or when position data is manually entered and has not been provided. And so, the latitude/longitude of (0,0) should generally be considered as an invalid position. Details on Null Island can be found on the following page:
https://en.wikipedia.org/wiki/Null_Island
When reviewing the data being published actively by nodes, hundreds of nodes are reporting a (0,0) position, while several are reporting improper latitude/longitude values. In this case, the nodes will be processed, however in actual system implementations, Lemuridae Labs would typically filter and flag these devices to be addressed.
Logical Data Validation
For data positions that pass the first validation steps, a data report quality can be applied if desired. In this case, the Mesh Scope application is not filtering packets that fail this logical validation process, however in more strict applications this is a consideration. In essence, location reports that are not physical possible, or are unreasonable, should be considered suspect and potentially discarded.
This process reviews the prior position and position timestamp with the new location reported, along with the report timestamp, to determine if the movement claimed by the sequence of reports is reasonable.
An example of this process is a device reporting a position in London, England and 30 seconds later reporting a position in Sydney, Australia. Given the speed required to move between these locations, this report could be flagged or rejected depending on the application data quality needs.
Summary
This article reviews the filtering that needs to occur when dealing with external data, and considerations in processing, flagging, or rejecting information as it is received. Different systems have different requirements and there is no universal answer. However, steps should be taken to understand that information may be improper, or flooding, as this information could skew processing and analytics if not properly identified. If nothing else a system should actively identify data issues and make a specific decision on how to handle, or not, any questionable information rather than simply processing it blindly.
The next article in this series will discuss data enrichment and subsequent processing of information through different facets of the Mesh Scope application. We are always interested in feedback and questions, and are happy to provide additional details and insights in future articles. Please get in touch for any assistance we can provide!