HDT Binary Format
DRAFT VERSION 03/07/2015
In the following we provide a technical description of the HDT binary serialization format (v.1.0).
Important note: Initial HDT specifications were published on 30 March 2011 as a W3C Member Submission. Since then, we have continued working to improve the format. This document in aimed at superseding the binary format details from the W3C submission and it is the format followed by all the tools and libraries provided in this site.
2. Control Information
HDT is a binary format that represents RDF Data in a compact manner. This document specifies the structure of an HDT file, including the headers, subsections and all details to implement an HDT reader/generator in any language / platform.
The HDT file is composed by three parts:
- The Header, that provides metadata about the dataset in plain RDF format.
- The Dictionary, that provides a mapping between strings and unique identifiers (IDs).
- The Triples, that encodes the RDF graph using IDs.
Each HDT file starts with a global “control information” section, and then one header, one dictionary and one triples section. Each section has another “control information” at the beginning, followed by section-specific internal data structures.
All multibyte fields are little endian, unless specified otherwise.
We use CRC (Cyclic Redundancy Code) to detect corrupted data within the file. We use:
- CRC8-CCIT. With polynom x^8 + x^2 + x + 1, in hex: 0×07
- CRC16-ANSI. With polynom x^16 + x^15 + x^2 + 1, in hex: 0×8005
- CRC32C. With polynom x^32 + x^28 + x^27 + x^26 + x^25 + x^23+ x^22 + x^20 + x^19 + x^18 + x^14 + x^13+ x^11 + x^10 + x^9 + x^8 + x^6 + 1, in hex: 0x1EDC6F41.
2. Control Information
It is a preamble that describes a chunk of information.
|HDT Cookie||Control Type||Format||[Properties]||CRC|
- HDT Cookie: The magic keyword ‘$HDT’, as four ASCII characters (No Null terminated).
- Control Type. Byte. One of: ( Unknown, Global, Header, Dictionary, Triples, Index ) as integer [0-5].
- Format: Null-Terminated String. URI identifier of the implementation of the following section.
- List of properties: List of key-value entries separated by semicolon, using an equal symbol to separate the key and value, and finished by a NULL character. Only ASCII characters are permitted, both ‘=’ and ‘;’ are prohibited. Please note that the last key value pair must also be terminated by a ‘;’ symbol. Example:
- CRC16 of the ControlInformation. Calculated since the first symbol of the magic keyword to the null terminator of the list of properties (both included).
2.1 Global Control Information
The HDT file starts with a global control information. It is a ControlInformation with the following specific properties:
- Type: Set to Global.
- Format: Set to the URI of the HDT Container specification. i.e. http://purl.org/HDT/hdt#HDTv1
- BaseURI (optional). Base URI for all the format sections of the file. If they specify a relative URI it will be concatenated to this BaseURI. If they specify a full URI, then BaseURI is ignored in that case. If this field is omitted, all format URIs must be absolute.
- Software (optional). Software that generated this file (URI).
The Header component is responsible for providing metadata about an RDF dataset. The concrete metadata (authors, issue dates, statistics of the data, a summary of the content, etc.) is optional and thus is not a subject of the format. Nonetheless, it is stated that the metadata should be an RDF graph. This allows expressing metadata about the data set (originally in RDF) with an RDF syntax, which can be discovered and used through well-known mechanisms, such as SPARQL Endpoints. The use of VoID as the main vocabulary of the Header and/or an extension of VoID for publishing HDT are strongly recommended.
As an example, we show the plain Header implemented by default in all the tools and libraries provided in this site:
First, the Header starts with a ControlInformation with the following specific properties:
- Type: Set to Header.
- Format: Set to the format of the header data, which is “ntriples” in plain header.
- length. The number of bytes of the header data.
Then, the header data is listed in ntriples (See example).
|<http://data.semanticweb.org/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/HDT/hdt#Dataset> .
<http://data.semanticweb.org/> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdfs.org/ns/void#Dataset> .
<http://data.semanticweb.org/> <http://rdfs.org/ns/void#triples> “242256″ .
<http://data.semanticweb.org/> <http://rdfs.org/ns/void#properties> “170″ .
<http://data.semanticweb.org/> <http://rdfs.org/ns/void#distinctSubjects> “23310″ .
<http://data.semanticweb.org/> <http://rdfs.org/ns/void#distinctObjects> “76529″ .
<http://data.semanticweb.org/> <http://purl.org/HDT/hdt#statisticalInformation> _:statistics .
<http://data.semanticweb.org/> <http://purl.org/HDT/hdt#publicationInformation> _:publicationInformation .
<http://data.semanticweb.org/> <http://purl.org/HDT/hdt#formatInformation> _:format .
_:format <http://purl.org/HDT/hdt#dictionary> _:dictionary .
_:format <http://purl.org/HDT/hdt#triples> _:triples .
_:dictionary <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#dictionaryFour> .
_:dictionary <http://purl.org/HDT/hdt#dictionarynumSharedSubjectObject> “23128″ .
_:dictionary <http://purl.org/HDT/hdt#dictionarymapping> “1″ .
_:dictionary <http://purl.org/HDT/hdt#dictionarysizeStrings> “5158239″ .
_:dictionary <http://purl.org/HDT/hdt#dictionaryblockSize> “8″ .
_:triples <http://purl.org/dc/terms/format> <http://purl.org/HDT/hdt#triplesBitmap> .
_:triples <http://purl.org/HDT/hdt#triplesnumTriples> “242256″ .
_:triples <http://purl.org/HDT/hdt#triplesOrder> “SPO” .
_:statistics <http://purl.org/HDT/hdt#originalSize> “47009552″ .
_:statistics <http://purl.org/HDT/hdt#hdtSize> “5945520″ .
_:publicationInformation <http://purl.org/dc/terms/issued> “2012-11-29T16:14:18+0000″ .
The dictionary is composed by an initial ControlInformation followed by type-dependent data.
The control information must have the following fields:
- Type: Set to Dictionary.
- Format. The URI of the dictionary’s specific type
- mapping. Type of numbering for the object section regarding to the shared. Mapping 1 means that IDs of objects go just after shared section, Mapping 2 that object IDs should follow subjects IDS. For instance if we had 2 shared, 1 subject and 3 objects, the objects with mapping 1 would be [ 1, 2, 3, 4 ], with mapping 2: [ 1, 3, 4, 5].
- elements. Total number of different strings in the dictionary.
Note that the field “format” determines the implementation of dictionary, which could be application specific. In the following, we show the FourSectionDictionary implemented by default in all the tools and libraries provided in this site. The format for this particular dictionary is:
FourSectionDictionary is an implementation of Dictionary that splits the dictionary into four sections:
- Shared subject-objects. Strings that appear at least one as a subject, and at least once as an object.
- Subjects. Strings that appear only as subject.
- Predicates. Strings that appear as predicate.
- Objects. Strings that appear only as object.
As stated, this dictionary is specified by denoting the format to
<http://purl.org/HDT/hdt#dictionaryPlain> in the Control information. After that, the dictionary is encoded by DictionarySection:
<DictionarySection shared> <DictionarySection subjects> <DictionarySection predicates> <DictionarySection objects>
A dictionary section is a data structure that maps a list of strings to their numerical IDs and vice-versa. The IDs must be correlative starting from 1.
A section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation:
- 1: DictionarySectionPlain. Stores all the strings in plain format, one per line. Given the large data output that this codification produces, this is deprecated.
- 2 (by default): DictionarySectionPlainFrontCoding. Stores the strings in blocks. On each block we use front coding to pack the strings.
- 3-6 (reserved): This range is reserved for specific string dictionaries (3:HTFC, 4:FMINDEX, 5: REPAIRDAC, 6:HASHHUFF), already implemented in our C++ library.
This DictionarySection implementation stores strings in blocks of “BlockSize” strings, using plain front coding to encode the strings internally on each block.
It contains a buffer binary area that packs all the blocks consecutively. The first string of each block is saved in plain format. The rest are represented using the number of common prefix characters, followed by a null-terminated string with each suffix. The last block may have less than “BlockSize” strings.
The data is always preceded by a metadata preamble with the following information:
VByte:<Total number of strings stored> VByte:<Length of the buffer containing the packed strings> <CRC8 of the previous fields> <LogArray Position, in number of characters, where each block starts> Note: The number of saved blocks can be calculated by the number of entries of the LogArray block pointers. <Buffer of Blocks> <First String of Block> <VByte Number of Common prefix characters> <Suffix String> <NULL Character> … as many times as necessary to fit “BlockSize” strings. <CRC32 of Buffer of Blocks>
A triples section stores the RDF graph structure using IDs. It contains a ControlInformation with the following required fields:
- Type: Set to Triples.
- Format. The URI of the triple’s specific type
- order: An unsigned 32bit value denoting the order of the triples: Unknown, SPO, SOP, PSO, POS, OSP, OPS (as [0-6]).
- numTriples: An unsigned 32bit value counting the number of total triples.
Then it is followed by the type-specific data structures to encode the triples. Similar to the dictionary, note that the field “format” determines the concrete implementation of the triples, which could be application specific. In the following, we show two variants, the TriplesList and BitmapTriples, implemented by default in all the tools and libraries provided in this site. The format for these particular variants is
Represents the graph structure as a list of Triples as IDs. Each tuple is represented as three unsigned 32bit values for subject, predicate, and object respectively. It has as many entries, as specified by the “numTriples” field of the ControlInformation.
<ControlInformation> <ArrayofTriples> <CRC32 of ArrayofTriples>
Represents the RDF graph as a three level tree (X, Y, Z), one for each component (S, P, O) in the order specified by the order field of the ControlInformation section. See example (in SPO order).
The bitmapTriples uses two arrays, Array Y and Array Z, to encode the two lower levels, and two bitmaps (sequences of bits), BitmapZ and BitmapY, to encode the parent/children relationships: in particular, a 1-bit denote that the corresponding element in the array is the last of the siblings.
The top level is left implicit.
<ControlInformation> <BitmapY> <BitmapZ> <ArrayY> <ArrayZ>
See example (in SPO order).
BitmapY and BitmapZ bits are always preceded by a metadata preamble with the following information:
Byte:<Format of bitmap (by default, and reserved, is "1")> VByte:<Total number of bits stored> <CRC8 of the previous fields> <Bits (byte aligned, little endian) <CRC32 of the previous bits>
Likewise, ArrayY and ArrayZ are also preceded by a metadata preamble, with the following information:
Byte:<Format of array: Three reserved implementations: Log64, uint32 and uint64, as [1,3] > VByte:<Total number of entries stored> <CRC8 of the previous fields> <Entries, codified in the specified format: In Log64, logbits of the number of entries in uint64 (byte aligned, little endian), or uint32 or uint64 entries. <CRC32 of the previous entries>
[VByte] Hugh E. Williams and Justin Zobel. Compressing integers for fast file access. The Computer Journal, 42:193–201, 1999.
[PlainFrontCoding] Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Compressed String Dictionaries. In Proc. of SEA, pages 136–147, 2011.