Common Datasets

We are providing in the cluster a few common datasets ready for consumption in HDFS. In these pages we will describe these datasets.

What kind of data ?

Different types of data have been loaded in order to embrace a variety of sets and to be reusable in different cases of analysis. We will focus here on three datasets types:

Relational dataset: This data type is a collection of information that are organized with defined relationships for easy to linking. It represents the base structure of relational database like RDBMS. In our case, IMDb datasets has been used.
Times Series Datasets: A series of timestamped data. It is mainly used in forecasts. As data type, we chose to isolate the famous NYC Taxi dataset.
Semi-structured dataset: Semi-structured data is a type of structured data where the information is not stored in a tabular way but is self-described. We chose to import the Wikidata datasets.

Sources of Data

Please navigate through each dataset’s documentation page to get more details about the data size and schemas.

IMDb datasets

The IMDb interface provides open data which contains multiple tables in compressed format. All the tables are described in details here.

NYC Taxi datasets

The New York City Taxi & Limousine Commission provides a public dataset about trip record information for many taxi rides. All informations related to the isolated datasets are described here.

Wikipedia datasets

The Wikimedia Foundation provides multiples datasets in JSON format. We imported the Wikidata dataset which is described here.

Additional information

Awesome Public Datasets