IMDb
The IMDb interface provides open data sets about IMDb movies. This set is best used as a relational dataset.
Description of data
The datasets are organized in seven tables in TSV
format. They are stored in HDFS in uncompressed type.
- The raw size of all tables is about
4.1 GB
- The storage path in HDFS is
/data/imdb_data
Schema of imdb datasets
Let’s describe the schema of each of datasets.
Schema of imdb_title.basics.tsv
“Basics” containers the video titles such as movies, documentaries, TV series, episodes, etc.
Fields | Types | Descriptions |
---|---|---|
tconst | String | a unique id for each video titles |
titletype | String | the type/format of the title (e.g. movie, short, TV series, TV episode, video, etc) |
primarytitle | String | the more popular title / the title used by the filmmakers on promotional materials at the point of release |
originaltitle | String | original title, in the original language |
isadult | String | 0: non-adult title; 1: adult title |
startyear | Short | represents the release year of a title. In the case of TV Series, it is the series start year |
endyear | Short | TV Series end year. ‘\N’ for all other title types |
runtimeminutes | Short | primary runtime of the title, in minutes |
genres | Array |
includes the genres associated with the title (which can be documentary, genre, action,…) |
Schema of imdb_name.basics.tsv
This one provides information of people in the entertainment business.
Fields | Types | Descriptions |
---|---|---|
nconst | String | a unique id of the name/person |
primaryname | String | name by which the person is most often credited |
birthyear | Short | in YYYY format |
deathyear | Short | in YYYY format if applicable, else ‘\N’ |
primaryprofession | array |
the top-3 professions of the person |
knownfortitles | array |
titles the person is known for |
Schema of imdb_title.akas.tsv
This dataset informs the alternative names for titles in different languages.
Fields | Types | Descriptions |
---|---|---|
titleId | String | a unique id for the title in question |
ordering | Short | a number to uniquely identify rows for a given titleId (between 1 and 142) |
title | String | the localized title |
region | String | the region for this version of the title |
language | String | the language of the title |
types | String | a numerated set of attributes for this alternative title, e.g ‘dvdvideo’, ‘festival’, ‘tv’ |
attributes | String | additional terms to describe this alternative title, not enumerated like ‘copyright title’, ‘Berlin film festival title, ‘expansion title’,… |
isOriginalTitle | Short | 0: not original title; 1: original title |
Schema of imdb_title.principals.tsv
“Principals” is a mapping of who participated in which title (movie / show).
Fields | Types | Descriptions |
---|---|---|
tconst | String | a unique id of the title |
ordering | Short | a number to uniquely identify rows for a given titleId |
nconst | String | a number to uniquely identify rows for a given titleId |
category | String | the category of job that person was in |
job | String | the specific job title if applicable, else ‘\N’ |
characters | String | the name of the character played if applicable, else ‘\N’ |
Schema of imdb_title.episode.tsv
This dataset a season and episode numbers, for episodes of shows.
Fields | Types | Descriptions |
---|---|---|
tconst | String | a unique id of the title |
parentTconst | String | a unique id of the table imdb_title_basics (foreign key) |
seasonNumber | Short | number of season of series |
episodeNumber | Short | number of episode of each number of season |
Schema of imdb_title.ratings.tsv
This one contains the current rating and vote count for the titles.
Fields | Types | Descriptions |
---|---|---|
tconst | String | a unique id of the title |
averagerating | Double | weighted average of all the individual user ratings |
numvotes | Short | number of votes the title has received |
Schema of imdb_title.crew.tsv
It contains the director and writer information for all the titles in IMDb
Fields | Types | Descriptions |
---|---|---|
tconst | String | a unique id of the title |
director | Array |
director(s) of the given title |
writers | Array |
writer(s) of the given title |
Hive Tables
The following Hive tables are available
The Hive database datasets
contains the following tables:
imdb_name_basics
imdb_title_akas
imdb_title_basics
imdb_title_crew
imdb_title_episode
imdb_title_principal
imdb_title_ratings
These tables are external
and are targeting the original TSV files in /data/imdb_data
.
Example query
0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z>select * from datasets.imdb_title_ratings order by averagerating desc limit 5;
+----------------------------+-----------------------------------+------------------------------+
| imdb_title_ratings.tconst | imdb_title_ratings.averagerating | imdb_title_ratings.numvotes |
+----------------------------+-----------------------------------+------------------------------+
| tt0050536 | 10.0 | 6 |
| tt9332910 | 10.0 | 6 |
| tt0127236 | 10.0 | 7 |
| tt0061689 | 10.0 | 15 |
| tt0061857 | 10.0 | 6 |
+----------------------------+-----------------------------------+------------------------------+
5 rows selected (7.117 seconds)