Wikidata entity dumps

The Wikidata set is a backup of all the entries of the Wikidata website. Wikidata serves as a central storage for all the “wiki” sites: Wikipedia, Wiktionary, etc. It is semi-structured data.

Description of the data set

From the Wikimedia foundation, the datasets are organized in different years and file formats. We have chosen the one below from March 2015:

https://ia800305.us.archive.org/15/items/wikidata-json-20150323/20150323.json.gz

The uncompressed data takes about 40.8 GB in JSON format.
The storage path of the data in HDFS is /data/wikimedia/wikimedia.json:

Schema of the dataset

The data consists is semi structured into items, each one having a label, a description and any number of aliases. Items are uniquely identified by a Q followed by a number. The mediawiki site gives more information about the schema.

The top level structure of the datasets is described below.

{
  "id": "Q60"                         \\ String   \\ The canonical ID of the entity.
  "type": "item"                      \\ String   \\ The entity type identifier. “item” for data items, and “property” for properties.
  "labels": {}                        \\ Struct   \\ It contains the labels in different languages
  "descriptions": {}                  \\ Struct   \\ It contains the descriptions in different languages
  "aliases": {}                       \\ Struct   \\ It contains aliases in different languages
  "claims": {}                        \\ Struct   \\ It contains any number of statements, groups by property
  "sitelinks": {}                     \\ Struct   \\ It contains site links to pages on different sites describing the item 
  "lastrevid": 195301613              \\ String   \\ The JSON document's version (this is a MediaWiki revision ID)
  "modified": "2015-02-10T12:42:02Z"  \\ String   \\ The JSON document's publication date (this is a MediaWiki revision timestamp)
}

For example the previous entry can be found on the Wikidata website here.

Labels, descriptions and aliases are represented by the same basic data structure. For each language, there is a record associated to the field language and value as following:

  "labels": { "en": { "language": "en", "value": "New York City" },
              "ar": { "language": "ar", "value": "\u0645\u062f\u064a\u0646\u0629 \u0646\u064a\u0648 \u064a\u0648\u0631\u0643" }
            },
            
  "descriptions": { "en": { "language": "en", "value": "largest city in New York and the United States of America" },
                    "it": { "language": "it", "value": "citt\u00e0 degli Stati Uniti d'America" }
                  },
  "aliases":  { "en": [{ "language": "en", "value": "NYC" },
                      { "language": "en", "value": "New York"},
                     ],
              "fr": [{ "language": "fr", "value": "New York City"},
                     { "language": "fr", "value": "NYC" },
                     { "language": "fr", "value": "The City" },
                     { "language": "fr", "value": "City of New York" },
                     { "language": "fr", "value": "La grosse pomme" }
                    ]
              }

Claims is grouped by property. Each property has different fields according to the datasets. The main ones are listed below.

"claims": {
    "P17": [
      {
        "id": "q60$5083E43C-228B-4E3E-B82A-4CB20A22A3FB",
        "mainsnak": {},
        "type": "statement",
        "rank": "normal",
        "qualifiers": {
          "P580": [],
          "P5436": []
         }
        "references": [
           {
             "hash": "d103e3541cc531fa54adcaffebde6bef28d87d32",
             "snaks": []
           }
         ]
      }
    ]
  }

Sitelinks are given as records for each site identifier. Each of these record contains site, title, badges and optionally url.

"sitelinks": { "afwiki": { "site": "afwiki", "title": "New York Stad", "badges": [] },
                 "frwiki": { "site": "frwiki", "title": "New York City", "badges": [] },
                 "nlwiki": { "site": "nlwiki",  "title": "New York City","badges": ["Q17437796"]},
                 "enwiki": { "site": "enwiki",  "title": "New York City", "badges": [] },
                 "dewiki": { "site": "dewiki",  "title": "New York City", "badges": ["Q17437798"]}
             }

Hive Table

A Hive tables has been created from this dataset. In order to optimize the storage space and the re-usability of the data in different contexts (BI, batch, streaming), ORC and AVRO were chosen as the storage format.

The Hive database datasets contains the following tables:

wikimedia_orc
wikimedia_avro

The Avro table has been created with the Avro Schema described in the /data/wikimedia/wikimedia_schema_avro file.

Example query

0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z> select count(*) from datasets.wikimedia_orc;

+-----------+
|    tot    |
+-----------+
| 17529258  |
+-----------+