Adaltas Cloud Academy
Sign out >

IMDb

The IMDb interface provides open data sets about IMDb movies. This set is best used as a relational dataset.

Description of data

The datasets are organized in seven tables in TSV format. They are stored in HDFS in uncompressed type.

  • The raw size of all tables is about 4.1 GB
  • The storage path in HDFS is /data/imdb_data

Schema of imdb datasets

Let’s describe the schema of each of datasets.

Schema of imdb_title.basics.tsv

“Basics” containers the video titles such as movies, documentaries, TV series, episodes, etc.

Fields Types Descriptions
tconst String a unique id for each video titles
titletype String the type/format of the title (e.g. movie, short, TV series, TV episode, video, etc)
primarytitle String the more popular title / the title used by the filmmakers on promotional materials at the point of release
originaltitle String original title, in the original language
isadult String 0: non-adult title; 1: adult title
startyear Short represents the release year of a title. In the case of TV Series, it is the series start year
endyear Short TV Series end year. ‘\N’ for all other title types
runtimeminutes Short primary runtime of the title, in minutes
genres Array includes the genres associated with the title (which can be documentary, genre, action,…)

Schema of imdb_name.basics.tsv

This one provides information of people in the entertainment business.

Fields Types Descriptions
nconst String a unique id of the name/person
primaryname String name by which the person is most often credited
birthyear Short in YYYY format
deathyear Short in YYYY format if applicable, else ‘\N’
primaryprofession array the top-3 professions of the person
knownfortitles array titles the person is known for

Schema of imdb_title.akas.tsv

This dataset informs the alternative names for titles in different languages.

Fields Types Descriptions
titleId String a unique id for the title in question
ordering Short a number to uniquely identify rows for a given titleId (between 1 and 142)
title String the localized title
region String the region for this version of the title
language String the language of the title
types String a numerated set of attributes for this alternative title, e.g ‘dvdvideo’, ‘festival’, ‘tv’
attributes String additional terms to describe this alternative title, not enumerated like ‘copyright title’, ‘Berlin film festival title, ‘expansion title’,…
isOriginalTitle Short 0: not original title; 1: original title

Schema of imdb_title.principals.tsv

“Principals” is a mapping of who participated in which title (movie / show).

Fields Types Descriptions
tconst String a unique id of the title
ordering Short a number to uniquely identify rows for a given titleId
nconst String a number to uniquely identify rows for a given titleId
category String the category of job that person was in
job String the specific job title if applicable, else ‘\N’
characters String the name of the character played if applicable, else ‘\N’

Schema of imdb_title.episode.tsv

This dataset a season and episode numbers, for episodes of shows.

Fields Types Descriptions
tconst String a unique id of the title
parentTconst String a unique id of the table imdb_title_basics (foreign key)
seasonNumber Short number of season of series
episodeNumber Short number of episode of each number of season

Schema of imdb_title.ratings.tsv

This one contains the current rating and vote count for the titles.

Fields Types Descriptions
tconst String a unique id of the title
averagerating Double weighted average of all the individual user ratings
numvotes Short number of votes the title has received

Schema of imdb_title.crew.tsv

It contains the director and writer information for all the titles in IMDb

Fields Types Descriptions
tconst String a unique id of the title
director Array (array of nconsts) director(s) of the given title
writers Array writer(s) of the given title

Hive Tables

The following Hive tables are available

The Hive database datasets contains the following tables:

  • imdb_name_basics
  • imdb_title_akas
  • imdb_title_basics
  • imdb_title_crew
  • imdb_title_episode
  • imdb_title_principal
  • imdb_title_ratings

These tables are external and are targeting the original TSV files in /data/imdb_data.

Example query

0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z>select * from datasets.imdb_title_ratings order by averagerating desc limit 5;

+----------------------------+-----------------------------------+------------------------------+
| imdb_title_ratings.tconst  | imdb_title_ratings.averagerating  | imdb_title_ratings.numvotes  |
+----------------------------+-----------------------------------+------------------------------+
| tt0050536                  | 10.0                              | 6                            |
| tt9332910                  | 10.0                              | 6                            |
| tt0127236                  | 10.0                              | 7                            |
| tt0061689                  | 10.0                              | 15                           |
| tt0061857                  | 10.0                              | 6                            |
+----------------------------+-----------------------------------+------------------------------+
5 rows selected (7.117 seconds)