What is Big Data?
Big Data describes the process of extracting actionable intelligence from disparate, and often times non-traditional, data sources. These data sources may include structured data such as databases, sensor, click stream and location data, as well as unstructured data like email, HTML, social data and images. The actionable data may be represented visually (e.g. in a graph), but it is often distilled down to a structured format, which is then stored in a database for further manipulation.
The sheer size of data being collected is more than traditional compute infrastructures can handle; exceeding the capacities of databases, storage, networks and everything in between. Extracting actionable intelligence from BigData requires handling large amounts of disparate data and processing it very quickly. Finally, the data inputs and the actionable intelligence must be correct; the data must be consistent and clean. As the saying goes, garbage in, garbage out. All of these demands are overwhelming traditional computing infrastructure. IBM describes these new demands across four dimensions: Volume, Velocity, Variety and Veracity. I would add Richly Linked Data to this list—processing Big Data uncovers rich relationships between that data—except I cannot think of a “V” word that says richly linked. To deal with the onslaught of Big Data, companies are turning to new tools and new business processes.
Big Data Tools
As Big Data overwhelms traditional databases, storage and more, companies are looking to exploit new tools like Hadoop, SSD, database virtualization, storage virtualization, network virtualization, and more. The reality is that you want to avoid single device bottlenecks, since they inhibit scaling. Hadoop uses map-reduce to spread analytical processing across armies of commodity servers. SSD, while expensive per GB of data capacity, provides the performance necessary to keep up with the velocity of Big Data. Virtualization of the database, storage and networking provides the elasticity and agility needed to scale to address Big Data demands, while delivering a consistent quality of service. These are just some of the tools being brought to bear on the Big Data challenge.
Big Data Business Processes
The benefits of Big Data are quite tantalizing. Big Data can be used to improve efficiency and the predictive capabilities in everything from health care to oil drilling. Once businesses get a taste of Big Data, their appetite becomes insatiable. This has spawned new business processes to meet the rising demand. Moving to the cloud is one such business process enabling Big Data. Cloud enables you to process your Big Data using say 1,000 machines for just an hour, paying only for the time you use them. This makes Big Data processing cost-effective in terms of both operational expenses (OpEx) and capital expenses (CapEx). Another interesting business process that is becoming popular is cloud-bursting. Cloud-bursting means running the core process on your own machines, but allowing overflow compute demands to run on a public cloud, typically for a short period of time. Creative companies will use these and other innovative business processes to deal with the growing demands of Big Data.
What Role do Databases Play in Big Data?
Big Data begets Bigger Data. The more a company recognizes the transformative role of Big Data, the more data they seek to capture and utilize. As a result, more companies are capturing more data. This includes everything from web analytics and click stream data to expanding their database schema to capture more transactional information. The more you utilize Big Data, the more data you seek to collect.
Databases are broken into two classes: analytical and transactional. Transactional databases capture structured information and maintain the relationships between that information. Transactional data is one feedstock for Big Data. Analytical databases then sift through the structured and unstructured data to extract actionable intelligence. Often times, this actionable intelligence is then stored back in a transactional database.
How are Transactional Databases Handling Big Data?
Big Data requires decentralization. Because of the volume and velocity of data being processed, centralization is anathema to Big Data. The networking, storage and compute must be decentralized or they will not scale. However, centralization is a core tenant of SQL databases. Traditional databases tightly link computation, caching and storage in a single machine in order to deliver optimal performance. There are two approaches to scaling SQL databases in order to handle Big Data—namely sharding and shared-data clustering.
One approach to decentralizing transactional databases is sharding. If you have an existing schema, sharding removes the relations between tables and then stores those various tables in separate databases. This forces the application layer to maintain, and in some cases reconstruct, those relationships. One common approach to sharding is to split customers across multiple databases. For example, you might have customers 1-10,000 in one database, then 10,001-20,000 in another database and so on.
Sharding is one way to scale your data handling needs, but it is very inflexible, it doesn’t adhere to the Big Data principle of agility. A sharded database cannot add new data sources, and new ways of processing that data, on the fly. Sharding creates a rigid structure that necessitates a painful re-sharding each time you modify or expand the data or relationships between the data.
Shared-data database clusters, as provided by ScaleDB and Oracle RAC®, deliver the agility required to handle Big Data. Unlike sharded databases, shared-data clusters support elastic scaling. If your database requires more compute, you can add compute nodes. If your database is I/O bound, you can add storage nodes.
In keeping with the Big Data principle of distributing the workload, shared-data clusters parallelize some processing across smart storage nodes, further eliminating bottlenecks, and allowing you to scale to address your Big Data needs.
Unlike sharded databases, shared-data clusters maintain the flexibility to add new tables and relationships on the fly. This flexibility is imperative, in order to keep up with the ever changing data sources and data relationships driven by Big Data.
How Can You Prepare for Big Data?
The most important first step in preparing for Big Data is to consider scale, parallelization, and agility. These issues must be considered when choosing your computing tools and your business processes. Maintain agility or flexibility, because your data and your processing needs will change and that change may be rapid and disruptive. Scale and parallelization go hand-in-hand. The only way you can scale to handle Big Data, is by leveraging parallelization. This means you must distribute processing, data and networking so as to avoid bottlenecks.
These same principles aplpy to your business processes. This may involve exploiting elastic cloud computing either directly or through cloud bursting. Consider that, as you plan your infrastructure and your schemas today, things will change relatively quickly. Big Data begets Bigger Data, so prepare for future scale and agility today.
For more information about Big Data and how databases are adapting to the demands of Big Data, see our Big Data White Paper