High-Velocity Data – The Data Fire Hose
What is High-Velocity Data?
Computer systems are creating ever more data at increasing speeds, and there are a growing number of consumers of that data—both operations and analytics. Hadoop-style batch processing has awakened engineers to the value of big data, but they increasing demand access to the data earlier. In essence people not only want all of the data, they want it as soon as possible; this is driving the trend toward high-velocity data. High-velocity—or fast data—can mean millions of rows of data per second, we are talking about massive volume. One of the use cases for high-velocity data is real-time analytics.
What is Driving the Explosion in High-Velocity Data?
Data generated by humans has been growing exponentially for quite some time, fueling the growth of companies like EMC and Netapp. In fact, 90% of the world’s data was created in the last 2 years. This really demonstrates how the world has embraced big data. However, the data generated by both devices and the actions of humans—such as log files, website click-stream data, and Twitter feeds—weren’t tracked or collected until recently, because the state of the art technology couldn’t handle that data velocity.
Big Data, driven largely by Hadoop, provided a mechanism for running analytics across massive volumes of data using a batch process. This gave people a reason to store these huge amounts of data. As people began deriving value from big data, they started wanting more. They began to ask why they couldn’t process these large volumes of data in real-time. This extreme level of data velocity requires new high-velocity data technologies.
What are the Sources of High-Velocity Data?
This is a list of some of the popular sources of high-velocity data today:
- Log Files: Devices, websites, database, any number of technologies log events. Log mining applications like Splunk and Loggly opened people’s eyes to the value in these log files. This resulted in an increase in logging and the richness of data collected in these log files.
- IT Devices: Networking devices (routers, switches, etc.) firewalls, printers, every device these days generates valuable data, a ssuming you can collect it and process it at scale.
- User Devices: One of the largest sources of high-velocity data is the use of smartphones. Everything you do on your smartphone is logged, providing valuable data.
- Social Media: Whether it is the Twittertweets, Facebook posts, Foursquare check-ins or any number of other social data streams, these create massive amounts of real-time data that degrades in value quickly.
- Online Gaming: Another source of real-time data based on user interactions, not just with the game but also with other users. This group includes the Massive Multiplayer Online Gaming (MMOG) like World of Warcraft as well as 1:1 games, many played on mobile phones, like Words with Friends.
- SaaS Applications: SaaS applications typically start with a limited set of functionality. As they mature, the functionality grows and user relationships and interactions also grow, creating a massive flow of real-time data. Linkedin is perfect examples of this trend. This high-velocity stream of events has led Linkedin to create Kafka a Complex Event Processor (CEP) that handles the routing and delivery of high-velocity event data.
There are many more sources of high-velocity data, including vertical sources, like the flood of GIS data found in oil and gas companies. As technologies come online to extract value from this high-velocity data, it is transforming many industries.
Managing the Flow of High-Velocity Data
The flood of high-velocity data can quickly overwhelm systems, especially during peak loads.
Furthermore, most applications need certain quality guarantees (delivery guarantee, deliver only once, etc.). To coordinate the flow of high-velocity data, some companies use Complex Event Processing (CEP) solutions based ona publish and subscribe approach. Examples of these include Java Messaging Service (JMS) and Apache Kafka, which came out of Linkedin. If you only need to manage the flow of data, CEP can help coordinate the flood of data.
Processing High-Velocity Data
The desire to extract real-time insight from high-velocity data led to the creation of Stream Processing Engines. These engines include Twitter’s Storm, Yahoo’s S4 and Linkedin’sSamza (built on top of the Kafka CEP above). These engines can route, transform and analyze a stream of data at high-velocity. However, they do not persist the data, instead they provide a brief sliding window on the data. For example, they might maintain a 2 minute or 10 minute view of the data, but the amount, or time window, is limited by the velocity of the data and the size of their memory. These engines can persistthe data to a database, giving you a comprehensive view of the historical data. Thisassumesthat your chosen database can handle the data velocity.
Persisting High-Velocity Data…the Database
Traditional Database Management Systems (DBMS) simply cannot handle the high-velocity data coming from modern applications. This is a data ingestion problem; think of a human sipping from a firehose and you’ll get the idea. Hadoopprovides batch processing of high-volume data, but when dealing with high-velocity data you need real-time processing. This has led to a few innovations.
Add a SQL Interface to Hadoop
The demand for persisting and querying high-velocity data in real-time has led a number of companies to add limited SQL interfaces to Hadoop. Examples of this approach include Apache Tez (Hortonworks), Impala (Cloudera), Hadapt (Hadapt) and Apache HBase. Hadoop and HDFS weren’t designed for database requirements—in fact their storage is based on large files, not small blocks—butcorporate demand for a solution to the high-velocity data ingestion problem is certainly strong.Hadoop is really optimized for data volume, not data velocity.
NoSQL is one solution to the high-velocity data ingest problem. The challenge NoSQL faces is the same challenge faced by Hadoop, namely that corporations have standardized upon and built expertise and tools around SQL, which doesn’t work for NoSQL databases.
In-memory databases eliminate the slowest piece of the traditional database—the disk—enabling databases to ingest data at a much higher rate than traditional databases. The two big contenders in the in-memory database world are HANA (SAP) and TimesTen (Oracle). However, in-memory databases are ill-suited to high-velocity data because their data size is limited to memory;they simply cannot handle the volume of data created by a high-velocity data source.
Extending MySQL to Handle High-Velocity Data: ScaleDB
Traditional databases, like MySQL, do not deliver sufficiently high data ingest rates to persist high-velocity data. ScaleDB changes all of that. ScaleDB extends MySQL without changing a single line of MySQL code, so the entire ecosystem (tools, applications, etc.) works with ScaleDB. ScaleDB’s new Streaming Table™ technology enables a small cluster of MySQL databases to ingest millions of rows of data per second. This data is then available for real-time manipulation using the rich tools that are already part of the MySQL ecosystem, such as Tableau Software, QlikView and LogiAnalytics.
In addition to running leading analytics tools, persisting the data in a database gives you the ability to query the data in an ad hoc fashion. If we use the exampleof a flow of colored balls, a stream processor can count green balls, or it can transform all data about red balls into orange balls. However, if you want to ask questions of the data, across a time series, you need database functionality. For example, using a database you can ask how many red balls were preceded by green balls, or how many orange balls we processed in the last hour, or any number of questions of any detail you need, all in an interactive fashion.
Selecting the Right High-Velocity Data Tool for Your Needs*
High-Velocity Data, over time, accumulates to create Big Data. Think of high-velocity data as the firehose, pumping out water that forms into a pond that represents big data. Hadoop has gained popularity for providing batch-oriented processing of big data. But batch processing is deficient in that it does not provide real-time processing or ad hoc queries.
Several classes of applications are generating high-velocity data, where Hadoop-style batch processing is insufficient. For example, a Massive Multi-Player Online Games (MMOG) might require a high-velocity data solution that serves multiple use cases, for example: (1) maintaining player state currently and in between session; (2) generating real-time analytics as a mechanism for modifying game play or informing operations; (3) supporting ad hoc queries from customer support; (4) Providing real-time action-based billing, and more. In this case a brief moving window of time, as provided by stream processing engines is insufficient, it requires high-velocity streaming persistence with an ad hoc—ideally SQL-based—interface.
Hadoop opened up whole new possibilities for extracting value from big data, or high-volume data. This led more and more companies to start collecting massive data, because they could extract value from it. The new wave of high-velocity data tools enable companies to extract real-time value from high-velocity data, instead of waiting for it to pile up and then running a batch process on it. Look for more companies to recognize this opportunity to drink upstream from their competition; using high-velocity data to make them more agile, responsive and ultimately more competitive.