Wikidata entity dumps
The Wikidata set is a backup of all the entries of the Wikidata website. Wikidata serves as a central storage for all the “wiki” sites: Wikipedia, Wiktionary, etc. It is semi-structured data.
Description of the data set
From the Wikimedia foundation, the datasets are organized in different years and file formats. We have chosen the one below from March 2015:
https://ia800305.us.archive.org/15/items/wikidata-json-20150323/20150323.json.gz
- The uncompressed data takes about
40.8 GB
in JSON format. - The storage path of the data in HDFS is
/data/wikimedia/wikimedia.json
:
Schema of the dataset
The data consists is semi structured into items, each one having a label, a description and any number of aliases. Items are uniquely identified by a Q followed by a number. The mediawiki site gives more information about the schema.
The top level structure of the datasets is described below.
{
"id": "Q60" \\ String \\ The canonical ID of the entity.
"type": "item" \\ String \\ The entity type identifier. “item” for data items, and “property” for properties.
"labels": {} \\ Struct \\ It contains the labels in different languages
"descriptions": {} \\ Struct \\ It contains the descriptions in different languages
"aliases": {} \\ Struct \\ It contains aliases in different languages
"claims": {} \\ Struct \\ It contains any number of statements, groups by property
"sitelinks": {} \\ Struct \\ It contains site links to pages on different sites describing the item
"lastrevid": 195301613 \\ String \\ The JSON document's version (this is a MediaWiki revision ID)
"modified": "2015-02-10T12:42:02Z" \\ String \\ The JSON document's publication date (this is a MediaWiki revision timestamp)
}
For example the previous entry can be found on the Wikidata website here.
Labels, descriptions and aliases are represented by the same basic data structure. For each language, there is a record associated to the field language
and value
as following:
"labels": { "en": { "language": "en", "value": "New York City" },
"ar": { "language": "ar", "value": "\u0645\u062f\u064a\u0646\u0629 \u0646\u064a\u0648 \u064a\u0648\u0631\u0643" }
},
"descriptions": { "en": { "language": "en", "value": "largest city in New York and the United States of America" },
"it": { "language": "it", "value": "citt\u00e0 degli Stati Uniti d'America" }
},
"aliases": { "en": [{ "language": "en", "value": "NYC" },
{ "language": "en", "value": "New York"},
],
"fr": [{ "language": "fr", "value": "New York City"},
{ "language": "fr", "value": "NYC" },
{ "language": "fr", "value": "The City" },
{ "language": "fr", "value": "City of New York" },
{ "language": "fr", "value": "La grosse pomme" }
]
}
Claims is grouped by property. Each property has different fields according to the datasets. The main ones are listed below.
"claims": {
"P17": [
{
"id": "q60$5083E43C-228B-4E3E-B82A-4CB20A22A3FB",
"mainsnak": {},
"type": "statement",
"rank": "normal",
"qualifiers": {
"P580": [],
"P5436": []
}
"references": [
{
"hash": "d103e3541cc531fa54adcaffebde6bef28d87d32",
"snaks": []
}
]
}
]
}
Sitelinks are given as records for each site identifier. Each of these record contains site, title, badges and optionally url.
"sitelinks": { "afwiki": { "site": "afwiki", "title": "New York Stad", "badges": [] },
"frwiki": { "site": "frwiki", "title": "New York City", "badges": [] },
"nlwiki": { "site": "nlwiki", "title": "New York City","badges": ["Q17437796"]},
"enwiki": { "site": "enwiki", "title": "New York City", "badges": [] },
"dewiki": { "site": "dewiki", "title": "New York City", "badges": ["Q17437798"]}
}
Hive Table
A Hive tables has been created from this dataset. In order to optimize the storage space and the re-usability of the data in different contexts (BI, batch, streaming), ORC and AVRO were chosen as the storage format.
The Hive database datasets
contains the following tables:
wikimedia_orc
wikimedia_avro
The Avro table has been created with the Avro Schema described in the /data/wikimedia/wikimedia_schema_avro
file.
Example query
0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z> select count(*) from datasets.wikimedia_orc;
+-----------+
| tot |
+-----------+
| 17529258 |
+-----------+