ScaleDB - Enterprise Scalability for Open Source Databases  
   
     

 
Technology

The Index
The ScaleDB technology is based on a compressed form of tries called Patricia tries. A Patricia trie differs from a standard trie in that nodes with one child are compressed into their parent node, so that all nodes have at least two children. Patricia tries achieve a high level of compression as the size of the Patricia trie does not depend on the length of inserted keys. Rather, each new key adds at most a single link and node to the index, regardless of the actual key length. Furthermore, unlike B-trees, Patricia tries grow slowly even as large numbers of strings are inserted because of the aggressive (lossy) compression inherent in the structure.

Although researchers have long known about Patricia tries, such structures have rarely been used to manage large amounts of data, especially disk-based data, because they are unbalanced and best suited for usage in main memory. What is needed is a structure that has the graceful scaling properties of Patricia tries, but is also balanced and optimized for disk-based access like B-trees. The ScaleDB is such a structure. It uses a novel layered trie approach to transform a Patricia trie into a disk-based index: extra layers of Patricia tries allow a search to proceed directly to a portion of the index that can answer a query. Every query accesses the same number of layers, providing balanced access to the index. We can think of these extra layers as a horizontal index complementing the vertical structure of the original Patricia trie (which resides in layer 0).

The search process examines one block per horizontal layer. In a disk-based index this means that the search could require one I/O per layer, unless the needed block is in the cache. The vast majority of space required by the index is for the original, vertical Patricia trie (layer 0), the other horizontal layers (layer 1,2,…n) are significantly smaller. In practice, this means that an index storing a large number of keys (e.g. a billion) requires three layers; layer 0 must be stored on disk but layers 1 and 2 can be easily stored in main memory. Key lookups require at most one I/O, for the leaf index layer (in addition to data I/O’s).

Designators
ScaleDB introduces the concept of designators to solve the problem of intermixing of unrelated search paths. Designators are special symbols or strings of symbols that are placed within key strings for the express purpose of distinguishing between key strings belonging to different search paths, while at the same time grouping together related key strings. In other words, designators are simply parts of key strings that are generated based upon some rule associated with the corresponding key, and not based upon the particular key value for the corresponding object. For the purpose of constructing the trie structure, designators are like any other part of key strings and do not require a change in any of the algorithms that operate on the structure.

As an example of the use of designators, using an example of employee and customer data, the employee name and customer name keys might use the designators “E” and “C”, respectively. Thus, the key string for employee “John Smith” would be “EJohn Smith”, while the key string for customer “John Smith” would be “CJohn Smith”. The figure below shows a simple example with four key strings, two using the designator “E” and two using the designator “C”. When searching for employee “John Smith”, we would use the query string “EJohn Smith”, and therefore would only encounter the key string for employee “John Smith” during the search traversal and not the key string for customer “John Smith”.

As demonstrated by the example, designators ensure that related key strings fall into the same subtrie. For example, both customer key strings, designated with “C”, fall into the left subtrie of to root node. At the same time, no employee key string can exist in the left subtrie, since such key strings are prepended by the designator “E”. Hence, we see that to be useful, designators must occur in the very beginning of key strings, and that this is a direct consequence of the fact that tries essentially index based on prefixes of strings. Actually, as shown below, it is often meaningful to use multiple designators in a single key string. In such cases, the key strings are composed of several components, each of which starts with a designator. We use the term designated key component to refer to components of keys whose values map into portions of key strings prepended with a designator. For example, for a composite “name-address” key for a “person” collection, we might designate the “name” key component with “N” and the “address” key component with “A”, in which case the key strings would have format “N<name>A<address>”.


Relationships Between Key Types and Designators
A key type essentially refers to a rule for constructing a key string for an object belonging to a particular collection, namely the designated key components and designators that are included in the key string, e.g., as for the above “name-address” key example. Thus, each key type is related to one or more designators. However, each designator may be shared by more than one key type, in cases where we want the same search path to potentially lead to objects in different collections, so the relationship is really “many-to-many”. For example, if we want a search on “name” to lead to both “employee” objects and “customer” objects, we might use the “N” designator for both the “employee.name” and “customer.name” keys. Perhaps the most important use of this feature would be for data integration, e.g., for allowing search access to two customer databases with very different schemas without actually merging the two.


Indexing Relationships
The ability to index long strings can be exploited to index relationships between objects, simply by concatenating the keys of the objects. For example, say we have a “customer” collection and an “invoice” collection, where each invoice is related to a single customer. This relationship can be captured by indexing the composite key “customer.ID-invoice.ID”, where the “ID” key in each collection is some unique identifier. This key would provide a search path through which we can, for example, find all invoices by a certain customer.

The example above is of a “one-to-many” nature, since one customer can relate to many invoices. “One-to-one” and “many-to-many” relationships can be represented in a similar manner. For example, a “product” collection and the “invoice” collection would have a “many-to-many” relationship, since each product can occur in multiple invoices and each invoice can contain multiple products. We might want to capture this relationship through both a “product.ID-invoice.ID” and a “invoice.ID-product.ID” composite keys, so that we can locate all invoices containing a particular product and all products in a particular invoice, respectively.

Relationships such as those mentioned above can be represented in various data representations, e.g., based on the relational data model. However, the most natural representation is one similar to the hierarchical or object-oriented data models, wherein all objects participating in a relationships can be located without further index lookups. Thus, in such a representation, when an object o1 is “subordinate” to an object o2 in a “one-to-one” or “one-to-many” relationship, o1 essentially incorporates a reference to o2 that allows accessing it. For example, in the case of “customer” and “invoice” example, each “invoice” object would contain a reference to the corresponding “customer” object.

For “many-to-many” relationships, such as the “product-invoice” relationship, it is not sufficient to store references in the objects themselves, since each participating object can be related to multiple objects. Instead, we must use some sort of “join” objects that contain references to all objects participating in a relationship. Thus, for example, for the “product-invoice” relationship, we would have “join” objects that represent a product occurring in a particular invoice. This “join” object can then be indexed according to both composite keys mentioned above, and would allow accessing the associated “product” and “invoice” objects.
 

[ Back to top ]

 

Home        About Us        Terms of Use        Privacy Policy        FAQ        Contact Us

 

© Copyright 2006-2008 - ScaleDB - All Rights Reserved.