IMDb

The IMDb interface provides open data sets about IMDb movies. This set is best used as a relational dataset.

Description of data

The datasets are organized in seven tables in TSV format. They are stored in HDFS in uncompressed type.

The raw size of all tables is about 4.1 GB
The storage path in HDFS is /data/imdb_data

Schema of imdb datasets

Let’s describe the schema of each of datasets.

Schema of imdb_title.basics.tsv

“Basics” containers the video titles such as movies, documentaries, TV series, episodes, etc.

Fields	Types	Descriptions
tconst	String	a unique id for each video titles
titletype	String	the type/format of the title (e.g. movie, short, TV series, TV episode, video, etc)
primarytitle	String	the more popular title / the title used by the filmmakers on promotional materials at the point of release
originaltitle	String	original title, in the original language
isadult	String	0: non-adult title; 1: adult title
startyear	Short	represents the release year of a title. In the case of TV Series, it is the series start year
endyear	Short	TV Series end year. ‘\N’ for all other title types
runtimeminutes	Short	primary runtime of the title, in minutes
genres	Array	includes the genres associated with the title (which can be documentary, genre, action,…)

Schema of imdb_name.basics.tsv

This one provides information of people in the entertainment business.

Fields	Types	Descriptions
nconst	String	a unique id of the name/person
primaryname	String	name by which the person is most often credited
birthyear	Short	in YYYY format
deathyear	Short	in YYYY format if applicable, else ‘\N’
primaryprofession	array	the top-3 professions of the person
knownfortitles	array	titles the person is known for

Schema of imdb_title.akas.tsv

This dataset informs the alternative names for titles in different languages.

Fields	Types	Descriptions
titleId	String	a unique id for the title in question
ordering	Short	a number to uniquely identify rows for a given titleId (between 1 and 142)
title	String	the localized title
region	String	the region for this version of the title
language	String	the language of the title
types	String	a numerated set of attributes for this alternative title, e.g ‘dvdvideo’, ‘festival’, ‘tv’
attributes	String	additional terms to describe this alternative title, not enumerated like ‘copyright title’, ‘Berlin film festival title, ‘expansion title’,…
isOriginalTitle	Short	0: not original title; 1: original title

Schema of imdb_title.principals.tsv

“Principals” is a mapping of who participated in which title (movie / show).

Fields	Types	Descriptions
tconst	String	a unique id of the title
ordering	Short	a number to uniquely identify rows for a given titleId
nconst	String	a number to uniquely identify rows for a given titleId
category	String	the category of job that person was in
job	String	the specific job title if applicable, else ‘\N’
characters	String	the name of the character played if applicable, else ‘\N’

Schema of imdb_title.episode.tsv

This dataset a season and episode numbers, for episodes of shows.

Fields	Types	Descriptions
tconst	String	a unique id of the title
parentTconst	String	a unique id of the table imdb_title_basics (foreign key)
seasonNumber	Short	number of season of series
episodeNumber	Short	number of episode of each number of season

Schema of imdb_title.ratings.tsv

This one contains the current rating and vote count for the titles.

Fields	Types	Descriptions
tconst	String	a unique id of the title
averagerating	Double	weighted average of all the individual user ratings
numvotes	Short	number of votes the title has received

Schema of imdb_title.crew.tsv

It contains the director and writer information for all the titles in IMDb

Fields	Types	Descriptions
tconst	String	a unique id of the title
director	Array (array of nconsts)	director(s) of the given title
writers	Array	writer(s) of the given title

Hive Tables

The following Hive tables are available

The Hive database datasets contains the following tables:

imdb_name_basics
imdb_title_akas
imdb_title_basics
imdb_title_crew
imdb_title_episode
imdb_title_principal
imdb_title_ratings

These tables are external and are targeting the original TSV files in /data/imdb_data.

Example query

0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z>select * from datasets.imdb_title_ratings order by averagerating desc limit 5;

+----------------------------+-----------------------------------+------------------------------+
| imdb_title_ratings.tconst  | imdb_title_ratings.averagerating  | imdb_title_ratings.numvotes  |
+----------------------------+-----------------------------------+------------------------------+
| tt0050536                  | 10.0                              | 6                            |
| tt9332910                  | 10.0                              | 6                            |
| tt0127236                  | 10.0                              | 7                            |
| tt0061689                  | 10.0                              | 15                           |
| tt0061857                  | 10.0                              | 6                            |
+----------------------------+-----------------------------------+------------------------------+
5 rows selected (7.117 seconds)