IMDb
The IMDb interface provides open data sets about IMDb movies. This set is best used as a relational dataset.
Description of data
The datasets are organized in seven tables in TSV format. They are stored in HDFS in uncompressed type.
- The raw size of all tables is about 4.1 GB
- The storage path in HDFS is /data/imdb_data
Schema of imdb datasets
Let’s describe the schema of each of datasets.
Schema of imdb_title.basics.tsv
“Basics” containers the video titles such as movies, documentaries, TV series, episodes, etc.
| Fields | Types | Descriptions | 
|---|---|---|
| tconst | String | a unique id for each video titles | 
| titletype | String | the type/format of the title (e.g. movie, short, TV series, TV episode, video, etc) | 
| primarytitle | String | the more popular title / the title used by the filmmakers on promotional materials at the point of release | 
| originaltitle | String | original title, in the original language | 
| isadult | String | 0: non-adult title; 1: adult title | 
| startyear | Short | represents the release year of a title. In the case of TV Series, it is the series start year | 
| endyear | Short | TV Series end year. ‘\N’ for all other title types | 
| runtimeminutes | Short | primary runtime of the title, in minutes | 
| genres | Array | includes the genres associated with the title (which can be documentary, genre, action,…) | 
Schema of imdb_name.basics.tsv
This one provides information of people in the entertainment business.
| Fields | Types | Descriptions | 
|---|---|---|
| nconst | String | a unique id of the name/person | 
| primaryname | String | name by which the person is most often credited | 
| birthyear | Short | in YYYY format | 
| deathyear | Short | in YYYY format if applicable, else ‘\N’ | 
| primaryprofession | array | the top-3 professions of the person | 
| knownfortitles | array | titles the person is known for | 
Schema of imdb_title.akas.tsv
This dataset informs the alternative names for titles in different languages.
| Fields | Types | Descriptions | 
|---|---|---|
| titleId | String | a unique id for the title in question | 
| ordering | Short | a number to uniquely identify rows for a given titleId (between 1 and 142) | 
| title | String | the localized title | 
| region | String | the region for this version of the title | 
| language | String | the language of the title | 
| types | String | a numerated set of attributes for this alternative title, e.g ‘dvdvideo’, ‘festival’, ‘tv’ | 
| attributes | String | additional terms to describe this alternative title, not enumerated like ‘copyright title’, ‘Berlin film festival title, ‘expansion title’,… | 
| isOriginalTitle | Short | 0: not original title; 1: original title | 
Schema of imdb_title.principals.tsv
“Principals” is a mapping of who participated in which title (movie / show).
| Fields | Types | Descriptions | 
|---|---|---|
| tconst | String | a unique id of the title | 
| ordering | Short | a number to uniquely identify rows for a given titleId | 
| nconst | String | a number to uniquely identify rows for a given titleId | 
| category | String | the category of job that person was in | 
| job | String | the specific job title if applicable, else ‘\N’ | 
| characters | String | the name of the character played if applicable, else ‘\N’ | 
Schema of imdb_title.episode.tsv
This dataset a season and episode numbers, for episodes of shows.
| Fields | Types | Descriptions | 
|---|---|---|
| tconst | String | a unique id of the title | 
| parentTconst | String | a unique id of the table imdb_title_basics (foreign key) | 
| seasonNumber | Short | number of season of series | 
| episodeNumber | Short | number of episode of each number of season | 
Schema of imdb_title.ratings.tsv
This one contains the current rating and vote count for the titles.
| Fields | Types | Descriptions | 
|---|---|---|
| tconst | String | a unique id of the title | 
| averagerating | Double | weighted average of all the individual user ratings | 
| numvotes | Short | number of votes the title has received | 
Schema of imdb_title.crew.tsv
It contains the director and writer information for all the titles in IMDb
| Fields | Types | Descriptions | 
|---|---|---|
| tconst | String | a unique id of the title | 
| director | Array | director(s) of the given title | 
| writers | Array | writer(s) of the given title | 
Hive Tables
The following Hive tables are available
The Hive database datasets contains the following tables:
- imdb_name_basics
- imdb_title_akas
- imdb_title_basics
- imdb_title_crew
- imdb_title_episode
- imdb_title_principal
- imdb_title_ratings
These tables are external and are targeting the original TSV files in /data/imdb_data.
Example query
0: jdbc:hive2://zoo-1.au.adaltas.cloud:2181,z>select * from datasets.imdb_title_ratings order by averagerating desc limit 5;
+----------------------------+-----------------------------------+------------------------------+
| imdb_title_ratings.tconst  | imdb_title_ratings.averagerating  | imdb_title_ratings.numvotes  |
+----------------------------+-----------------------------------+------------------------------+
| tt0050536                  | 10.0                              | 6                            |
| tt9332910                  | 10.0                              | 6                            |
| tt0127236                  | 10.0                              | 7                            |
| tt0061689                  | 10.0                              | 15                           |
| tt0061857                  | 10.0                              | 6                            |
+----------------------------+-----------------------------------+------------------------------+
5 rows selected (7.117 seconds)