Common Datasets
We are providing in the cluster a few common datasets ready for consumption in HDFS. In these pages we will describe these datasets.
What kind of data ?
Different types of data have been loaded in order to embrace a variety of sets and to be reusable in different cases of analysis. We will focus here on three datasets types:
-
Relational dataset: This data type is a collection of information that are organized with defined relationships for easy to linking. It represents the base structure of relational database like RDBMS. In our case,
IMDb
datasets has been used. -
Times Series Datasets: A series of timestamped data. It is mainly used in forecasts. As data type, we chose to isolate the famous
NYC Taxi
dataset. -
Semi-structured dataset: Semi-structured data is a type of structured data where the information is not stored in a tabular way but is self-described. We chose to import the
Wikidata
datasets.
Sources of Data
Please navigate through each dataset’s documentation page to get more details about the data size and schemas.
IMDb datasets
The IMDb interface provides open data which contains multiple tables in compressed format. All the tables are described in details here.
NYC Taxi datasets
The New York City Taxi & Limousine Commission provides a public dataset about trip record information for many taxi rides. All informations related to the isolated datasets are described here.
Wikipedia datasets
The Wikimedia Foundation provides multiples datasets in JSON format. We imported the Wikidata dataset which is described here.