- What is Database Virtualization?
- Database Virtualization Challenges
- Database Virtualization Advantages
- Degrees of Database Virtualization
What is Database Virtualization?
Virtualization is a shift from managing dedicated physical servers, to managing a pool of logical compute resources. Virtualization sacrifices some degree of compute efficiency, in exchange for flexibility, mobility, manageability, scalability/elasticity and high-availability. Database virtualization means different things to different people; from simply running the database executable in a virtual machine, or using virtualized storage, to a fully virtualized elastic database cluster composed of modular compute and storage components that are assembled on the fly to accommodate your database needs. We will look at each type of virtualization and the benefits they provide. Virtualization is anathema to traditional OLTP databases. These databases tightly integrate the compute, caching and storage in order to: (a) optimize performance; (b) coordinate locking in order to ensure that the database remains consistent; (c) provide “copies” for fail-over or high-availability. Fully virtualizing an existing database management system (DBMS) is analogous to replacing the entire engine and drivetrain in a car, it involves replacing all the hard technology. This article looks at some of the challenges of virtualizing databases, the benefits of database virtualization and then looks at some of the approaches to database virtualization and the vendor solutions implementing those approaches.
Database Virtualization Challenges
Traditional databases operate on a single physical computer. They tightly integrate the compute, caching and storage functions, operating as a single unit on a single server. Because they are limited to a single dedicated server, traditional databases display certain constraints:
- Scaling: A database running on a single server is limited by the number of users, transactions, data files, etc. that it can handle. These are physical limitations imposed by the hardware that is running the database. Scaling up means moving your database to a more powerful machine. Scaling out means running your database across multiple machines. Elasticity is a much higher hurdle for databases. Elasticity means the ability to add and remove compute or storage capabilities on-the-fly, without downtime.
- Quality of Service: You can experience poor Quality of Service (QoS) as a result of heavy processing on a shared server—consuming CPU, cache, disk—or heavy network usage—on a shared network card or sharing the same switch. Because databases are tied to a physical server, and the application tier addresses that specific server, you are unable to easily move the database to other servers or other areas of the network to address QoS problems. Such a move requires shutting down the application and database, copying the files to another server and restarting both the application and the database.
- High-Availability: Operating databases contain both data and “state”. State is the current collection of information stored in memory and processes—such as transactions—in process at the moment. Simply copying the data files to another server does not copy the state to the other server. To make matters worse, the state of a database is constantly changing. As a result, almost all databases require shutting down the database and application in order to move them to another server.Fail-over means that you can restart the database with little or no loss of data, but it does require shutting down the application. High-availability means that the database handles the failure of one or more servers without interrupting the database or application. High-availability is a high hurdle.At other layers of the software stack, virtualization provides inherent high-availability. However, because of the importance of state in maintaining a consistent database, high-availability must be addressed inside the DBMS itself.
- Pooling/Exploiting Unused Resources: Database servers are sized for peak load, plus growth, plus a buffer. This is done so that the database doesn’t run out of resources and fail. Because you must size your database for peak usage, it is not uncommon to have your database average about 10% of the server’s capacity. However, with a dedicated server, this capacity (compute and storage) go unused. Virtualization can pool and repurpose this unused capacity.
Database Virtualization Advantages
“Our focus has been that many databases are accessible and manageable as if they were a single database. Virtualization provides a common framework for better availability, scalability, manageability and security.”– Noel Yuhanna, Principal Analyst, Forrester Research
Server and storage virtualization are the critical technologies that enable cloud computing. Prior to the advent of virtualization, applications resided on a physical server. While average utilization might run only 10-20% of that physical server, it had to have sufficient excess capacity to handle usage spikes and future growth. In other words, most of the time you are paying for 80-90% more computing than you need. Virtualization enables those unused resources to be pooled and shared. This enables you to pay only for what you use, the 10-20% utilization, and then usage spikes are spread across the unused or excess compute power on the cloud. From the perspective of the public cloud, this is the equivalent of renting 3-times more servers than you physically have, due to efficient pooling and sharing of resources. This ability to rent more servers than you own is what drives the economics of cloud computing. Unfortunately, this model of virtualization simply doesn’t work with SQL databases. The database tightly integrates data and compute, so sharing the peak load across additional machines simply doesn’t work. In order to support load sharing across additional servers or instances, the database would need to add distributed locking and queuing, distributed cache coherency, distributed indexing and failure recovery capabilities. There are workarounds that provide some of the advantages of database virtualization, but to enjoy all of the advantages of database virtualization, the database must provide these capabilities. There are 10 core advantages to database virtualization. We use the following 10 icons to represent each advantage. These icons provide a quick visual representation of the advantages delivered by each approach and the associated products.
|Pooling Unused Resources: When a database is not fully utilizing the server it is running on, can those resources be pooled and made available to other applications? This is one of the virtualization triumvirate of “Virtualize, Pool, Automate”.|
|Scalable Compute: Can the database add more compute power—from additional servers—on the fly? This is ideal for CPU bound databases.|
|Elastic Compute: Can the database add/and remove compute power—from a pool of servers—on the fly? This is a superset of Scalable Compute.|
|Scalable Storage: Can the database add storage/caching capacity—from additional servers or storage devices—on the fly. This is ideal for I/O bound databases.|
|Elastic Storage: Can the database add/and remove storage/caching capacity—from a pool of servers or storage devices—on the fly?|
|Highly Available Storage: Does the database continue running if a storage device or storage node fails?|
|Highly Available Compute: Does the database continue running if a compute node fails?|
|Node Virtualization: Can the application address any database node, regardless of the request? This is often referred to as multi-master.|
|Compute Mobility: Can a database compute node be moved to another server, without shutting down the database?|
|Data Mobility: Can database files be moved to another device or storage node, without shutting down the database?|
These criteria provide a convenient and visual way of representing advantages provided by each approach to virtualization and therefore the degrees of virtualization offered by each approach.
Degrees of Database Virtualization
There are a variety of ways to virtualize the database, each with its own collection of benefits. Since the database function is comprised of compute (CPU and RAM involved in processing database requests) and storage (files or blocks containing the data). The compute and storage can be virtualized together or independently of each other. The different approaches to virtualization are discussed below and each is awarded the icons that represent the database virtualization advantages they deliver.
Running the Database in a Virtual Machine:
This simply means running a single instance of the database executable on top of a virtual machine. This enables you to release unused compute and storage to the pool of virtual resources, when the database is not fully utilizing them. On a dedicated server—because the server is sized for peak usage—it is not uncommon for the database to only consume an average 10% of the server’s capacity. The ability to pool and repurpose these resources is a good first step toward database virtualization. The inverse of this perspective, is allocating pooled resources to the database, as needed. For example, increasing the memory, disk or CPU available to the database. This is nothing more than the inverse perspective or pooling unused resources—two sides of the same coin—but this is a perspective that is often used to describe the benefits of virtualization, earning it a mention here.
Virtualization and Sharding:
Sharding is one approach to partitioning the database, resulting in multiple identical images of the database, each storing different unique pieces of the data. For example if you have a million users, you might have ten shards containing 100,000 userseach. Since each database has an identical schema, exceeding 1,000,000 users means you simply spin-up an eleventh image of the database. Then you bring down the application, update the database abstraction layer to accommodate the additional server, then restart the application. Virtualization and sharding are very complementary. The other advantage of virtualization and sharding is that you can allocate resources to the virtual machines powering various shards, according to their needs. For example, if the shard containing users 400,001 – 500,000 uses less resources, you can reduce them accordingly. Similarly, if users 600,001 – 700,000 are heavy users of the database, you can allocate more resources to that shard. While sharding and virtualization enjoy some synergy, sharding does not fully exploit the advantages of database virtualization. Sharding relies on a tight integration between the compute and the data files, so you cannot separate the two and scale them independently. While each shard has a common schema, they are each limited to their unique piece of the database, and the application tier must know which piece (or shard) of that data is on which machine, so it can route the database requests accordingly. This creates physical silos of data that are tied to machines. A sharded and virtualized database does not, by itself, provide the ability to add compute and storage on the fly—as represented by the icons above—that requires load balancing tools and additional code in the application tier. The downside of sharding is that the database shards all act as independent databases. This means that any function that operates across these shards must be moved from the database, where it is normally handled, into the application. For example, if you want to do joins, counts, range scans, aggregates, etc. that include more than a single shard, you have to code that capability in the application tier. This creates more work, introduces more potential bugs and is less efficient than simply processing requests in the database itself. Additional Information: Wikipedia, Tools: SQL Azure Federations, CodeFutures’ dbshards, ScaleBase, various NoSQL Databases.
Databases typically address data as blocks, versus data files. In the past, Network Attached Storage (NAS) stored data as files, while Storage Attached Network (SAN) stored data as blocks, but now both NAS and SAN provide block storage. Leading vendors of both storage devices provide storage virtualization, effectively mapping logical requests for data to the physical location of the data. They both implement tiered storage plans, where most frequently used data is stored in memory, then solid state disks (SSD or Flash), an finally on rotating disks. By virtualizing the data, the most popular data can be moved, on the fly, from slower to faster storage media. Additional Information: Wikipedia Products/Services: Virtualized storage products are available from EMC, Netapp, HP, IBM, Hitachi, Dell, Oracle, Amazon and others.
Database Storage Virtualization/Replication:
The actual files or blocks of data stored by a database do not capture all of the “state” information of the database. There are also the transactions in process. Simply copying the files or blocks to another instance of the database fails to capture this state information, and yields an inconsistent copy of the data. In order to capture a consistent copy of the data, for use by another instance of the database, you must maintain the links between the databases. This can be accomplished by uni-directional replication (master-slave) or bi-directional replication (master-master) Database Replication, which is built into most databases, sends information—typically the log file—which is then “played” or executed by a slave database as if the commands were sent directly to that database. Multi-Master Replication This is a bi-directional replication system that allows writes on more than a single master and then replicates those changes across a collection of servers. This is supported by various commercial databases. Galera also provides a multi-master configuration for MySQL. More information about multi-master replication is available here. While multi-master sounds great, it is often a retro-fit or after-market fix to the database and can result in additional problems, including database inconsistency. Data Replication does not cause the database to process a log, it creates and maintains the file- or block-level synchronization itself. Examples include Linbit’sDRBD and Delphix. These approaches can be used to maintain a fail-over copy of the data, or to create copies for use in functions such as reporting/analytics, QA, test, development, etc.
Replicated In-Memory Databases:
In an effort to provide more mobility of the database instances, some companies have created in-memory instances. By maintaining all of the data and state in RAM, it is well contained, fast and more mobile. In-memory DBMS have long been projected to replace disk-based database, but have perennially fallen short. While working in memory only provides performance advantages, relative to disk-based solutions, they have faced challenges with increasing data size, durability and recoverability. One of the earlier in-memory only databases was Times10, indicating a 10X performance advantage by remaining in memory. The most recent entrant to the in-memory database is SAP’s Hana. VMWare has implemented in-memory databases in an effort to make them easier to virtualize [1, 2, 3].
Sharded Databases with SQL Routing:
When using a partitioned, or sharded, database, you can run a SQL-aware load balancing process above the various sharded databases to facilitate routing database requests to the appropriate server. This SQL-aware load balancing process can be operated as a separate tool that works with various underlying databases (e.g. ScaleArc), or it can be all handled under the covers by extending the database itself (e.g. MySQL Cluster and Xeround). By putting a router in front of a sharded or shared-nothing database, every database request requires two additional network hops, to the actual data and back again. If you have a single routing node, like Scalarc, you can include caching a la Memcached. However, this approach limits throughput, since you have only one node handling all routing, in order to avoid cache incoherency. Using a SQL-aware router in front of a standard database does not enhance availability of the database, it merely handles routing and/or caching. The alternate approach—used by MySQL Cluster and Xeround—sacrifices the performance boost of local caching, in exchange for higher throughput by using multiple routing nodes. In order to achieve reasonable routing performance, all indexes must be maintained in memory; otherwise, database requests could insert an additional disk look-up, which has a huge impact on performance.
Shared-Data Clustered Databases:
Oracle Real Application Clusters (RAC) and ScaleDB are shared-data clusters. Shared-data databases inherently separate the compute from the storage, enabling both pieces to be virtualized and to scale independently of each other. Traditional databases suffer from a tight integration between the data and the compute nodes. By separating these two functions, and turning them into virtual building blocks, clusters can be assembled and modified on the fly, all without a single point of failure. While some of the shared-data advantages mirror those provided by sharded databases with SQL routing, there are significant differences under the covers that manifest themselves in a superior performance profile. The performance advantages are primarily in the following areas:
- Performance improvement from local caching on the first database node contacted, which frequently avoids additional network hops
- Faster cross database functions (e.g. joins, range scans, counts, aggregates, etc.) since each database node see the entire database and can process any function by itself, without involving other shards.
- Indices are not limited to memory, and can overflow into disk, while still delivering excellent performance
- The ability to distribute queries to smart storage, in parallel, to further boost performance. This is analogous to map reduce, where each smart storage node processes its portion of the data, and then the database node combines the results from various smart storage nodes. For example, if the database gets a request for all sales in the past week, this request can be sent to the storage nodes, which process the data locally and only send the results. This both parallelizes the processing and reduces the network traffic, since it only send results over the network and not the entire table.
By separating and virtualizing the compute and storage functions of the database, shared-data DBMS deliver more of the benefits of database virtualization than any other approach. For more information on database virtualization, read this white paper.
|Running the Database in a Virtual Machine||
|Virtualization and Sharding||
|Database Storage Virtualization/Replication||
|Replicated In-Memory Databases||
|Sharded Databases with SQL Routing||
|Shared-Data Clustered Databases||