Lemuridae Labs

IoT streaming data can grow to be quite large over time, and a system needs to decide what information is necessary to process, transform, and save based on business and operational requirements. Much data is essentially ephemeral and can be discarded after processing, and other data is short lived, so long-term archiving is not a concern. At times there must be a clear archive of key data, however, and maintaining this information in a way that is both cost effective and performant is a challenge. Simply dumping unbounded data into a standard relational database can work initially, but over time the performance may decrease while the operational costs will increase. Alternate strategies of exporting data into static formats and export files can be more cost effective, but introduces a challenge towards actually using the information again. The burden tends to shift between data availability and data management, making it difficult to manage the total cost of ownership.

Iceberg

Apache Iceberg is an interesting solution that can address both of these issues at the same time. Iceberg is an open table format meaning it provides a capability to describe and index data stored on secondary storage, such as disks and cloud storage. Storing information on this secondary storage, rather than in a primary relational database, is much cheaper and avoids the operational overhead of managing very large relational databases. It also provides the ability to segment the "hot" transactional data found in a system database from the archival data stored in an iceberg table.

Although not exhaustive, the quote summary below provides a quick high-level summary of some of the interesting features of Apache Iceberg. We will then apply a set of these features to the IoT data streams being processed.

Apache Iceberg is an open-source table format designed to bring the reliability and simplicity ofSQL tables to large-scale analytic data lakes. It enables users familiar with SQL to build andmanage data lakes without needing new languages, supporting ACID transactions, schema evolution, and time travel capabilities. Iceberg organizes data files and metadata through snapshots that capture the full state of a table at any point in time, allowing multiple query engines to work on the same datasets in a consistent and isolated manner. This architecture supports features like hidden partitioning, partition evolution, and fast query planning, making it highly efficient for handling huge datasets with complex, evolving schemas.

From a data management and total cost of ownership (TCO) perspective, Apache Iceberg offers significant advantages. Its support for ACID transactions and snapshot isolation ensures data consistency and reliability, reducing errors and operational overhead. Schema and partition evolution allow organizations to adapt their data models quickly without costly rewrites or downtime, accelerating development cycles. Iceberg’s metadata management optimizes query performance by pruning unnecessary data, which minimizes compute resource consumption and lowers cloud storage costs. Furthermore, its open architecture and compatibility with multiple fileformats and query engines provide flexibility to avoid vendor lock-in and leverage existing infrastructure. These features collectively reduce complexity, improve governance with versioning and rollback capabilities, and enable scalable, cost-effective management of large analytic datasets.

As noted above, Iceberg manages the metadata of information allowing for it to be queried and processed, making it easy to store information but to also retrieve it when needed. This effectively addresses both ends of our cost challenge earlier, making it easy to store and easy to retrieve, while storing in a reasonably low-cost manner.

With Iceberg providing a table format and data indexing capability, a secondary process of actually storing the data is still required. A traditional database performs all of these functions at once, but Iceberg gives additional flexibility in this area. It is common to use Apache Parquet as an on-disk structure for storing the actual data itself. Parquet files are efficient and may be easily read through a variety of tools, and together the pairing of Iceberg and Parquet are a complete data archive and management capability.

Recent updates in the Parquet data format and Iceberg data indexing have added support for geospatial processing, indexing, and queries, which will add a whole new range of interesting features and capabilities relevant to the MeshScope domain.

An IoT Iceberg

For our MeshScope application, we incorporated Iceberg as a means to demonstrate the ability to process and archive fast and unbounded event streams in a cost effective manner. We have two types of data messages we will consider for iceberg archiving, specifically position or map reports and text messages. In our case, we have no interest in total long-term archiving of the Meshtastic network, and consider there to ultimately be privacy considerations with such an action, we do want to review the technology applied to this challenge. Many other similar data streams, such as processing GPS device reports, asset positions, and other updates will have similar challenges.

The initial Iceberg implementation in MeshScope is implemented in a local filesystem for simplicity in testing and evaluation, storing then parquet data files and index management files in a local directory structure. This is a good easy process for iterative local development however ultimately introduces a scaling challenge as the machine will run out of space when receiving unbounded data streams with thousands of messages a minute.

This implementation avoided adding other software components to the system, although these may bring additional benefits for a full system implementation. This implementation is incorporating the Iceberg and Parquet code directly in the MeshScope application, but using Apache Spark or Flink for managing the Iceberg data storage process have other benefits, at additional management costs.

The benefit to this local-based implementation was that it allowed MeshScope to test and validate the capture of data in an Iceberg index, and allowed the individual Parquet files to be tested and queried. Using tools such as duckdb, both the Iceberg metadata and the Parquet files can be viewed and evaluated.

With an initial implementation validated, the system was adjusted to use AWS S3 for storing the parquet data files, and AWS Glue for the index metadata. Although this implementation is cloud-provider specific, other options may be considered when storing data in alternate locations. Beyond the benefit of unbounded storage for the IoT data streams, the AWS Glue storage for metadata allows periodic Parquet files to be compacted. As a Parquet file in Iceberg is immutable after being created, it is a challenge with streaming IoT data causes the creation of many small files. The large number of files creates management overhead and impacts query performance.

A periodic compaction process was implemented to aggregate smaller Parquet files into larger ones, while ensuring the Iceberg metadata tracked these updates properly. This periodic compaction provided an effective balance between the speed to write, when receiving streaming data, and optimizing the data for query and retrieval performance.

Results

The incorporation of Apache Iceberg into MeshScope was not intended to be a critical data store, but rather to explore the capture, processing, and retrieval of unbounded IoT data streams in a flexible way. With Iceberg's flexibility on storage and metadata management, it has become a logical choice to consider for such solutions. As the query performance can be quite good, although not generally to the full speed of a traditional RDBMS, is is a valuable tool to use in many circumstances.

The net result is that MeshScope can retain more data for longer with less system overhead. The primary database supporting the system will contain less online data and can be more focused on processing updates and real-time activity, while Iceberg provides the ability to retrieve historical events and activity very quickly. Although the primary database is quite scalable and could be scaled up to meet the requirements of MeshScope, other IoT data streams are multiple orders of magnitude greater and the cost and management overhead becomes a challenge for teams. Having a hybrid data management process is a better way to manage cost (operational and infrastructure) and performance tradeoffs for this type of scenario.

The Iceberg implementation within MeshScope is foundational and will continue to expand over time. The use of the Iceberg-based data within the AI model processing, both with chat and the agentic event processing, is a planned feature.

Let us know what you would be interested in hearing about!

LEMURIDAELABS

Internet of Things (IoT) hits an Iceberg