In the following section we provide a brief technical overview of the HDT internals. Read “What is HDT” for a basic introduction to HDT.
|1. Introduction |
5. Querying HDT-encoded datasets
6. HDT in Numbers
The current Web of Data is producing increasingly large RDF datasets. The need to exchange large datasets has unveiled the drawbacks of traditional RDF representations: high levels of verbosity/redundancy and weak machine-processable capabilities. Similar problems arise with smaller datasets in smartphones and embedded devices. This scenario calls for efficient formats for publication, exchange and consumption of RDF.
RDF HDT is a data-centric format which reduces verbosity in favor of machine-understandability and data management. The specification defines a compact binary serialization format optimized for storage or transmission over a network.
An HDT-encoded dataset is composed by three logical components (Header, Dictionary, and Triples), carefully designed to address RDF peculiarities. They are shown in Figure 1.
- Header: The Header holds metadata describing an HDT semantic dataset using plain RDF. It acts as an entry point for the consumer, who can have an initial idea of key properties of the content even before retrieving the whole dataset.
- Dictionary: The Dictionary is a catalog comprising all the different terms used in the dataset, such as URIs, literals and blank nodes. A unique identifier (ID) is assigned to each term, enabling triples to be represented as tuples of three IDs, which reference their respective subject/predicate/object term from the dictionary. This is a first step toward compression, since it avoids long terms to be repeated again and again. Moreover, similar strings are now stored together inside the dictionary, fact that can be exploited to improve compression even more.
- Triples: As stated before, the RDF triples can now be seen as tuples of three IDs. Therefore, the Triples section models the graph of relationships among the dataset terms. By understanding the typical properties of RDF graphs, we can come up with more efficient ways of representing this information, both to reduce the overall size, but also to provide efficient search/traversal operations.
Figure 1: HDT Components: Header-Dictionary-Triples
The main strengths of the HDT format are:
- Compactness: Use less space to encode the same information, therefore saving storage, communication bandwidth and transfer times. Designed taking into account the peculiarities of RDF, improves the figures of general compressors, such as GZIP, BZIP2.
- Clean Publication and exchange: Keeps the metadata easily accessible in the Header, with information such as provenance as statistics. The content separates the terms (Dictionary) from the structure of relationships among them (Triples).
- On-demand indexed access: permits fast search operations to parts of the file without having to decompress it as a whole.
In summary, the binary HDT file can be directly loaded into the memory of a computational system and accessed as a data structure, therefore avoiding the expensive indexing operations. Thanks to the compression techniques, more data is readily available at the higher levels of the memory hierarchy. Data in faster memory always means faster retrieval operations.
The Header component is responsible for providing metadata about an RDF dataset. Whereas current other serialization formats do not provide any means or even best practices on how to publish metadata along with datasets, in HDT metadata is a first-class citizen with a dedicated place as part of the header information. We consider the Header as a flexible component in which the data provider may include a desired set of features:
- Publication metadata: Collects the metadata about the publication act such as the site of publication, dates of creation and modification, version of the dataset (which could be useful for updating notifications), language, encoding, namespaces, etc. It also includes all kind of authority information about the source (or sources) of data. Many properties of this type are described using the popular Dublin Core Vocabulary.
- Statistical metadata: provides statistical information about the dataset, such as the number of triples, the number of different subjects, predicates, objects, or even histograms. For instance, this class of metadata is very valuable for visualization software or federated query evaluation engines.
- Format metadata: describes how Dictionary and Triples components are encoded. This allows to have different implementations or representations of the same data in different ways. These metadata enable the consumer for checking how an HDT-encoded dataset can be accessed in the data structure.
- Additional metadata: allows specific dataset/application metadata to be described. For instance, in life sciences a publisher might want to describe, in the Header, that the dataset describes a specific class of proteins.
2.1 Header implementation overview
The Header component is expressed in plain RDF, and the use of standard vocabularies, such as VoiD, is strongly recommended. VoiD is an RDF Schema vocabulary for expressing metadata about RDF datasets, a mechanism for publishers to report their data, and an entrance point for consumers to discover and retrieve datasets.
We refer to hdt as the vocabulary extension, with the namespace http://purl.org/HDT/hdt#hdt. The HDT Vocabulary lists the full set of elements.
The structure of the proposed Header is represented in Figure 2. A Header must include at least one resource of type hdt:Dataset, described by four top-level statements (containers), which correspond to the different types of metadata described before. Exactly one format metadata description must be present in order to manage the dataset, and hdt:dictionary and hdt:triples definition (or subproperties of them) are required.
Figure 2: The structure of the proposed Header of HDT
The following example is a Header in Turtle syntax for an RDF graph.
Example 1. A Header in HDT
@prefix void: <http://rdfs.org/ns/void#>. @prefix dc: <http://purl.org/dc/terms/>. @prefix foaf: <http://xmlns.com/foaf/0.1/>. @prefix hdt: <http://purl.org/HDT/hdt#>. @prefix xsd: <http://www.w3.org/2001/XMLSchema#>. @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>. @prefix swp: <http://www.w3.org/2004/03/trix/swp-2/>. <http://example.org/ex/DBpediaEN> a hdt:Dataset ; a void:Dataset ; hdt:publicationInformation _:publication ; hdt:statisticalInformation _:statistics ; hdt:formatInformation _:format ; hdt:additionalInformation _:additional; void:triples "431440396" ; void:properties "57986" ; void:distinctSubjects "24791728" ; void:distinctObjects "108927201" . _:publication dc:issued "2012-11-23T23:17:50+0000" ; dc:license <http://www.gnu.org/copyleft/fdl.html> ; dc:publisher [ a foaf:Organization ; foaf:homepage <http://www.dbpedia.org>] ; dc:source <http://downloads.dbpedia.org/3.8/en> ; dc:title "DBpediaEN" ; void:sparqlEndpoint <http://www.dbpedia.org/sparql> . _:statistics hdt:originalSize "110630364018" ; hdt:hdtSize "3082795954" . _:format hdt:dictionary _:dictionary ; hdt:triplesBitmap _:triples . _:dictionary dc:format hdt:dictionaryFour ; hdt:dictionaryNamespaces [hdt:namespace [hdt:prefixLabel "dbpedia" ; hdt:prefixURI "http://dbpedia.org/resource/"]] ; hdt:dictionarynumSharedSubjectObject "22762644" ; hdt:dictionarysizeStrings "1026354060" ; hdt:dictionaryBlockSize "8" . _:triples dc:format hdt:triplesBitmap ; hdt:triplesOrder "SPO ; hdt:triplesnumTriples "431440396" . _:additional swp:signature "AZ8QWE..." ; swp:signatureMethod "DSA" .
The main goal of the Dictionary is to contribute to compactness by the assignation of a unique ID to each element in the dataset. Thus, the use of the dictionary implies two important and minimum operations:
locate(element): returns a unique identifier for the given element, if it appears in the dictionary.
extract(id): returns the element with identifier id in the dictionary, if it exists.
The dictionary is divided into sections depending on whether the term plays subject, predicate, or object roles. Nevertheless, in semantic data it is quite common that a URI appears both as a subject in one triple and as object on another. To avoid repeating those terms twice in the subjects and in the objects sections, we can gather them into a fourth section called shared Subject-Object.
Figure 3 depicts the 4-section dictionary organization and how IDs are assigned to the corresponding terms. Each section is sorted lexicographically and then correlative IDs are assigned to each term from 1 to n. It is worth noting that, for subjects and objects, the shared Subject-Object section uses the lower range of IDs; e.g. if there are m terms playing interchangeably as subject and object, all IDs x such that x <= m belong to this shared section.
Figure 3: HDT dictionary organization into four sections.
HDT allows to use different techniques of dictionary representation. Each one can manage its catalog of terms in different ways, as long as they implement the mandatory locate/extract operations. In addition they can implement additional enhanced operations, such as prefix, full-text or even regular expression searches.
3.1 Dictionary implementation overview
We implement a Front-Coding (Witten, Moffat, and Bell 1999) based representation as the most simple way of dictionary encoding. It has been successfully used in many WWW-based applications involving URL management. It is a very simple yet effective technique based on differential compression. This technique is applied to lexicographically sorted dictionaries by dividing them into buckets of b terms. The first term in the bucket is explicitly stored and the remaining b – 1 ones are encoded based on their precedent: first the length of common prefix of the current term compared to the previous, and then the remaining suffix is appended in plain format. By tweaking the bucket size, different space/time tradeoffs can be achieved.
HDT proposes a Triples encoding called BitmapTriples (BT) that organizes the information in a way that exploits graph redundancy to keep the information compact. Moreover, this encoding can be easily mapped into a data structure that allows basic retrieval operations to be performed efficiently.
BT needs the triples to be previously sorted in a specific order, such as subject-predicate-object (SPO). BT is able to handle all possible triple orderings, but for the explanation we will focus on the more intuitive SPO order. Basically, BT transforms the graph into a forest containing as many trees as different subjects are used in the dataset, and these trees are then ordered by subject ID. This way, the first tree represents all triples rooted by the subject identified as 1, the second tree represents all triples rooted by the subject identified as 2, and so on. Each tree comprises three levels: the root represents the subject, the second level lists all predicates related to the subject, and finally the leaves represent all objects for their path (subject, predicate). Predicate and object levels are also sorted, as shown in Figure 4 (middle).
Figure 4: Description of Bitmap Triples.
Then, the BT forest model is physicaly encoded layer by layer. The top layer (subjects) is always a correlative list from 1 to the total number of subjects, due to the correlative assignation of IDs in the dictionary and the SPO order of the triples. Therefore the subjects do not need to be encoded and is left implicit. This is an first obvious spatial saving.
The predicate layer is stored using i) a sequence of predicate IDs (Sp); ii) a bitsequence (Bp) containing one bit per element in Sp: A 1 bit indicates that the predicate is the first child of its parent subject in the tree, the remaining siblings are marked with 0. This allows knowing to which subject is each predicate associated by counting the number of 1’s in the bitmap up to that position, or even locating the range of predicates associated to the n-th subject by locating the n-th and (n+1)-th 1 in the bitmap. The object layer is stored likewise, using a sequence of Object IDs (So) together with a bit sequence (Bo). Each 1 bit indicates the first object of each parent predicate, therefore allowing traversals in the tree upwards and downwards.
5. Querying HDT-encoded datasets
Triple patterns are the SPARQL query atoms for basic RDF retrieval. That is, all triples matching a template (s; p; o) (where s, p and o may be variable) must be directly retrieved from the Triples encoding. An HDT-encoded dataset can be directly accessed once its mapped into main memory. Nevertheless, it does not support all kind of triple patterns directly on that structure, it is restricted to SPO, SP?, S?? and ??? queries. Thus, once the HDT-encoded dataset is loaded into the memory hierarchy, we slightly enrich the representation with additional succinct data structures to support the remaining triple patterns to be solved efficiently. The final fully queryable representation is called HDT-FoQ: HDT Focused on Querying. In the following we briefly present the main extensions. The technical specification of the structures can be found in our publication (Martínez-Prieto, Arias, and Fernández 2012).
The encoded Front-Coding dictionary already has enough information to resolve the previously described locate/extract operations. Thus, this component follows the idea of “the data are the index”. We invite interested readers to review the paper of (Brisaboa, Canovas, Claude, Martínez-Prieto, and Navarro 2011) for a more detailed description on how encoded dictionaries provide indexed access capabilities.
The previously described BitmapTriples approach is easy to map due to the simplicity of its encoding. Sequences Sp and So are loaded into two integer arrays using respectively log(|P|) and log(|O|) bits per element. Bitsequences can also be mapped directly, but in this case they are enhanced with an additional small structure (Gonzalez, Grabowski, Makinen, and Navarro 2005) that ensures constant time resolution for some basic bit-operations, such as locating a certain 1-bit or 0-bit and count the number of 1-bits or 0-bits up to a given position. All fine-grain details about these data structures are explained in (Martínez-Prieto, Arias, and Fernández 2012).
Thus, a SPO BT structure provides fast subject-based retrieval when mapped into memory, but is access by predicate and object is very inefficient. To paliate this issue we enhance the original data structure with an additional index built at loading time to complement the HDT. Without going into too much detail, the original Sp is now loaded as a wavelet tree (Grossi, Gupta, and Vitter 2003), instead of an array. A wavelet tree is a succinct structure which organizes a sequence of integers, in a range [1; n], to provide some access and seek operations to the data in logarithmic time. Finally, we enhance the triples component with an additional index (called O-Index), that is responsible for solving accesses by object. This index basically gathers the positions in where each object appears in the original So. Again, all details are published in (Martínez-Prieto, Arias, and Fernández 2012).
All this infrastructure enables basic triple patterns to be resolved very efficiently in compressed space. This kind of queries are massively used in practice, but the SPARQL core is defined around the concept of Basic Graph Pattern (BGP) involving more than a single triple pattern, and additional semantics to build conjunctions, disjunctions, and more operators. Thus, HDT-FoQ must be extended with more advanced query resolution techniques to reach a full SPARQL coverage. At this moment, it is able to resolve conjunctive queries by using the well-known merge and index join algorithms.
6. HDT in numbers
Once we described HDT, it is time to compare it against existing solutions. Here we focus on the Publish-Exchange-Consume scenario, where a publisher makes the data available on the Web for consumers to download and use. We understand that consuming means doing something useful with the data, i.e. issuing queries to extract knowledge, task that requires an index to traverse the data efficiently. So we first analyze the times required for each stage, and then the overall time. Finally we compare the query performance against the state-of-the-art RDF Stores RDF-3x and Virtuoso Open Source. The experiments were run on a server machine Intel Xeon E5645@2.4GHz with 96GB of DDR3 Ram and a modest consumer machine AMD-PhenomTM-II X4 firstname.lastname@example.orgGHz, quad-core with 8GB of DDR2 RAM. All the details are in our paper (Martínez-Prieto, Arias, and Fernández 2012).
The first question is, how much can HDT compress the data? As seen in Figure 5, HDT takes between 6% and 11% space compared to the original NTriples, which is a significant reduction. Compared to NTriples+GZIP, HDT is slightly bigger in some cases, because HDT must keep direct access to the data, so it cannot apply very aggressive compression techniques that would otherwise hamper query performance. However, we can complement the HDT with GZIP or LZMA. Given that HDT keeps the data packed and sorted internally, the generic compression algorithms have more chances to find repetitions, which results in smaller sizes. Therefore, for exchanging HDT files over a network, we recommend using GZIP in addition to HDT, which results in roughly half the size of NTriples+GZIP.
Figure 5: Compression ratio of HDT.
Download and Decompress
Now, on the client side, the user needs to download and decompress the file. Due to the HDT file being smaller, the download time is reduced to a half, but also the decompression time is smaller because the workload of the general decompression algorithm is now lightweight, having smaller input and output files.
Figure 6: Time to Download a Decompress a dataset, assuming a Network connection of 2MByte/s.
The most costly operation when dealing with RDF datasets is indexing, especially for big datasets whose index does not fit in main memory. In HDT, most of the burden is done already in the server, so only the complementary index (the Wavelet and the O-Index) need to be generated in client side. As seen in Figure 7, the difference here is huge. HDT is able to have the data ready to start querying in just a few seconds, or in a few minutes even for really big datasets such as DBPedia and Freebase.
|LinkedMDB||1 min 51 sec||1.91 sec|
|DBLP||23 min 7 sec||16.7 sec|
|Geonames||44 min 51 sec||44.9 sec|
|DBPedia 3.7||2 hour 11 min||2 min 9 sec|
|Freebase||7 hour 5 min||4 min 46 sec|
Table 1. Client-side indexing Time.
Overall Client Time
If we measure the time since the user starts downloading until she is ready to start querying/browsing, HDT reduces the waiting time up to 20 times. As you can see, in just over 3 minutes you can have Geonames to complement your semantic application, or in 17 minutes you can have a local queryable copy of DBPedia, including download time!
Figure 7: Comparison of the total time required since the start to download a dataset until it is ready to query. Less is better.
|LinkedMDB||4.1 min||9.21 sec|
|DBLP||27 min||2.02 min|
|Geonames||49.2 min||3.04 min|
|DBPedia 3.7||3 hour 9 min||17.3 min|
|Freebase||8 hour 43 min||23.4 min|
Table 2. Total time to start querying a dataset.
One strength of HDT is its size. Thanks to the compression, most of the data can be kept in main memory, which is several orders of magnitude faster than disks. As you can see in Table 3, the size of an HDT together with its additional indices (HDT-FoQ) is up to 5 times smaller than typical RDF Stores, such as Virtuoso / RDF-3x. In a commodity machine with 8 GB of RAM, you can have almost all DBPedia or Freebase in main memory. This means less disk access during querying and therefore better performance.
Table 3. Total index size.
To test the speed we tried the most basic access operations on RDF; triple patterns. We randomly generated batches of queries of each type for two datasets and we ran the batch of queries on the C++ version of HDT and other RDF-Stores using just one thread of execution. Figure 8 shows that HDT is up to 10 times faster in certain cases, especially when the subject or object is fixed, which are very common operations. For the queries ?P? and ?PO that return thousands of results HDT is in the same order of magnitude as their competitors.
Figure 8: Time HDT is faster solving simple triple patterns, compared to state-of-the-art solutions. Higher means HDT is better.
Based on the above figures, HDT is very interesting for applications where performance is a major requirement. With current RAM prices, it is getting affordable to have the whole dataset in main memory and dispatch millions of queries in parallel.
[Brisaboa, Canovas, Claude, Martínez-Prieto, and Navarro 2011]
Brisaboa, N., R. Canovas, F. Claude, M. Martínez-Prieto, and G. Navarro (2011). Compressed String Dictionaries. In Proc. of 10th International Symposium on Experimental Algorithms (SEA), pp. 136-147.
[González, Grabowski, Mäkinen, and Navarro 2005]
González, R., S. Grabowski, V. Mäkinen, and G. Navarro (2005). Practical implementation of rank and select queries. In Proc. of 4th International Workshop Experimental and Efficient Algorithms (WEA), pp. 27-38.
[Grossi, Gupta, and Vitter 2003]
Grossi, R., A. Gupta, and J. Vitter (2003). High-order entropy-compressed text indexes. In Proc. of 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 841-850.
[Martínez-Prieto, Arias, and Fernández 2012]
Martínez-Prieto, M., M. Arias, and J. Fernández (2012). Exchange and Consumption of Huge RDF Data. In Proc. of the 9th Extended Semantic Web Conference (ESWC), pp. 437-452.
[Witten, Moffat, and Bell 1999]
Witten, I. H., A. Moffat, and T. C. Bell (1999). Managing Gigabytes : Compressing and Indexing Documents and Images. Morgan Kaufmann.