IoT data processing creates unique challenges that make data streaming handled with Flink a reliable solution.
Way back when, our first client at Freeport Metrics asked us to help them build software for smart power meters. That was around 10 years ago, before the “Internet of Things” (IoT solutions) term went mainstream. Since then, we expanded to many other verticals beyond energy, including fintech, healthcare and foodtech. But the ‘Metrics’ part our name remained important to our identity, because no matter the sector, every software product we built relied on data.
Data Streaming + Flink + IoT = A Good Idea
Over the years, we’ve helped our clients build a solar energy monitoring and billing platform, optimize wind farm data processing, and create a large scale, real-time RFID asset tracking platform. In all the cases we’ve seen similar challenges:
- Devices produce way more data then users do - traditional databases were not built for such purposes
- IoT users expect real-time information they can act on immediately - ETL or batch data operations are not good enough
- Connectivity is never guaranteed when you send data from edge devices over cellular networks
Reading Tyler Akidau’s post about data streaming was the turning point for us. Every single argument in the article resonated with our experience and provided an elegant abstraction to replace custom solutions that had to be built before. (Check out Tyler’s just-released book on Streaming Systems.)
Two years after adopting data streaming at FM and battle testing it on the asset tracking project mentioned above, we are strongly convinced that it is often the right way to go for IoT data processing. We also recommend Apache Flink as the state of the art tool for that purpose. This is why:
1. Real-Time Data Processing Is a Game Changer
When it’s windy outside but your wind turbine produces no energy, you’d rather know sooner than later. When a precious asset is about to leave your facility for an unknown reason, you want to know immediately. Many IoT use cases require immediate information and an action to follow.
Operating on streams instead of batches changes the programming paradigm fundamentally and allows for triggering calculations immediately when data is available, setting timely alerts, or detecting event patterns in a continuous manner.
Additionally, processing data as it arrives may be optimal for performance in some cases. For example, no computations are needed if no new event occurred, which means you don’t have to recompute the whole data set periodically to get fresh results.
2. Event Time is Usually the Right Way to Order IoT Data
When data from your devices travels through a cellular network, it’s useful to take latency and network failures for granted. Even if you send it over a more stable connection, you cannot beat the laws of physics and latency will increase with distance from your data center.
On a smaller scale, imagine a factory and a part quickly moving through a production line with sensors along that line. There is no guarantee that readings from sensors will arrive over the network in the order they were captured.
More often than not, it makes sense to process events ordered based on time at which they occurred (event time), not when they arrived at the data center or at the time they are actually processed (ingestion and processing time). So it’s convenient to use a framework that provides event time support out of the box.
3. Tools for Dealing with Messy Data
Data pre-processing is usually the hardest part of the process. It is even harder when you don’t fully control the source, as is often true in the IoT world. You can end up with a serious portion of glue/clean-up code and bizarre conditional logic.
Of course data streaming doesn’t fix your data for you, but it proposes a couple nice tools. The most useful in our opinion is windowing. Windowing is a concept of grouping data from a certain time period together for further processing. Let me give you a couple examples.
We worked with power meters that sent data in fast, but failure-prone Modbus protocol during the day and produced precise batch files at the end of it. In this case you could use Modbus data to calculate approximate live statistics to display to end users and then replace them with batch file data later so it can be used for accurate billing. This can be achieved by writing a trigger which produces partial results as new data comes and closes the window when it gets batch data.
Sometimes, there is now no way to tell if all your data has arrived. In this case you would trigger windows according to some heuristic watermark you can empirically calculate or assume some max latency in the system. (Think of a watermark as an event that pushes time forward in the system). Flink also lets you specify allowed lateness of elements and provides side outputs to handle events that come later.
4. Segmentation Allows for Parallel Processing
Often times different users of an IoT solutions are interested in calculations performed only on a subset of data.
Let’s say you created a platform that allows cat owners to see where their beloved pets are wandering. Each owner needs data only from the GPS tracker of his/her own cat. Flink introduces the concept of grouping by key for that purpose.
Once a stream is partitioned it can be processed in parallel. You can scale up horizontally. Of course a key doesn’t have to be bound to a single IoT device or location. For example, in the case of fleet management you may want to group different signals related to a single vehicle together (GPS, hardware sensors, license plate scan at parking gates, etc.)
One addition to Flink that we would welcome with pleasure would be watermarks per key. It would help in case of network partitioning when some devices reconnect after longer period of time and want to catch up with buffered data. Right now time progress is kind of global in the system, and missing data from one device can block processing for other devices (if you need a watermark that waits for data from all the devices to progress time).
We also recommend exploring the latest Streaming Ledger functionality which allows for distributed transactions between parallel streams (a paid feature of dA Platform as of the time of writing).
5. Local State is Crucial to Performance
While “latency numbers every programmers should know” change each year thanks to progress in hardware and infrastructure, some fundamental rules remain constant:
- The closer data is to you the faster you can process it
- Disc IO is no good for performance
Don’t be mislead that local state is just another form of read-only local cache. It truly shines when you update it with event data. For example, you could store historical values of sensor readings in the local state and update it with new data to calculate live statistics.
Some people even question the need for another data storage and propose using Flink as the single source of truth. We weren’t that adventurous ourselves and most of the examples from the community use some external storage for further offline processing or other purpose.
If you want to get really philosophical, see this talk about the convergence of streaming and microservice architecture by Viktor Klang from Lightbend.
6. Flink Loves Kafka
You say “big data message broker,” you think Kafka. Nowadays, Kafka is the de facto standard for large volume event ingestion. It’s fast and battle tested.
Flink provides first class support for Kafka, both for consuming and producing events. Flink’s distributed nature plays nicely with Kafka partitioning. It also offers exactly once end to end processing with Kafka if your use case requires it.
7. Data Streaming is Conceptually Simple (Once You Get Used to It)
Last but not least, once you adopt a data streaming approach, it just feels natural. It would be an overstatement to say adopting a streaming approach is trivial. It’s not. We took some time to wrap our heads around how to use state properly in Flink or how operator parallelism works. That said, once you get used to it, you can focus mostly on core logic of your application as most of the ‘dirty’ work has been handled by the framework for you.
We felt the same way when we switched from old school JSF to Spring for Java development, used Ruby on Rails for the first time or saw that Angular and React can handle HTML DOM updates for you so you don’t need to manipulate it manually. At some point, you just know that it’s the right tool for the job and you wonder how you ever lived without it.