Investigate TBDb Movie Data
Table of Contents
Introduction
TMDB is a movie metadata database, it contains the details of 5000 hollywood Moves. The data set I have picked from Kaggle and it is cleaned from the original dataset.
I would like to pose the following questions and Investigate them.
- Does the runtime of the movie is a deciding factor on the popularity it receives?
- Which gener of movies have the highest revenue collections on average?
- Are action movies more popular than romance movies?
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
Data Wrangling
In this section of we will collect the data from https://www.kaggle.com/tmdb/tmdb-movie-metadata and load it and make it use able and according to the need to answer our question
General Properties
# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df = pd.read_csv('tmdb-movies.csv')
df.head(3)
id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
3 rows × 21 columns
# check the distribution of various features
df.describe()
id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|
count | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 |
mean | 66064.177434 | 0.646441 | 1.462570e+07 | 3.982332e+07 | 102.070863 | 217.389748 | 5.974922 | 2001.322658 | 1.755104e+07 | 5.136436e+07 |
std | 92130.136561 | 1.000185 | 3.091321e+07 | 1.170035e+08 | 31.381405 | 575.619058 | 0.935142 | 12.812941 | 3.430616e+07 | 1.446325e+08 |
min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
25% | 10596.250000 | 0.207583 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
50% | 20669.000000 | 0.383856 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
75% | 75610.000000 | 0.713817 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 145.750000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.369710e+07 |
max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
Data Cleaning
For cleaning the data I have done the following steps.
- Remove the irrevelent columns form the dataset which won’t be used for the analysis. Namely, imdb_id’, ‘homepage’, ‘keywords’, ‘overview’, ‘director’,’tagline’, ‘cast are dropped form the dataset.
- Find and drop duplicate rows.
- Fill nan values of production_companies and genre with ‘Unknown’ keyword.
- Extract Year from relesed_year and convert it to int.
- Replace the cells with 0 values in budget, revenue, budget_adj, revenue_adj with their mean values.
- Remove outliers from runtime. Remove rows with runtime > 300 minutes as movies with more than 300 min runtime are few and It was making the distribution very skew.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id 10866 non-null int64
imdb_id 10856 non-null object
popularity 10866 non-null float64
budget 10866 non-null int64
revenue 10866 non-null int64
original_title 10866 non-null object
cast 10790 non-null object
homepage 2936 non-null object
director 10822 non-null object
tagline 8042 non-null object
keywords 9373 non-null object
overview 10862 non-null object
runtime 10866 non-null int64
genres 10843 non-null object
production_companies 9836 non-null object
release_date 10866 non-null object
vote_count 10866 non-null int64
vote_average 10866 non-null float64
release_year 10866 non-null int64
budget_adj 10866 non-null float64
revenue_adj 10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
# remove the columns which are not needed
df.drop(['imdb_id', 'homepage', 'keywords', 'overview', 'director','tagline', 'cast'], axis=1, inplace=True)
df.head(1)
id | popularity | budget | revenue | original_title | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
# find the duplicates
df.duplicated().sum()
1
Lucikly, we have only one duplicate. We will simply drop it
# drop duplicates
df.drop_duplicates(inplace=True)
#check if any duplicate
df.duplicated().sum()
0
# check for nan values
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 14 columns):
id 10865 non-null int64
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
runtime 10865 non-null int64
genres 10842 non-null object
production_companies 9835 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(4)
memory usage: 1.2+ MB
- Here, we can observe that there are unfilled/NaN values in production_companies, genres
- Let’s fill these values with ‘Unknown’
- Only a few (13) movies will have ‘Unknown’ generes
# fill values to nan cells
df.production_companies.fillna('Unknown', inplace=True)
df.genres.fillna('Unknown', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 14 columns):
id 10865 non-null int64
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
runtime 10865 non-null int64
genres 10865 non-null object
production_companies 10865 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(4)
memory usage: 1.2+ MB
Now, all the values filled up and irrevelant colums removed. Now lets fix the data types
df.release_date.isna().sum()
0
# Extract year from the release_date using datetime and convert it to int
df['release_date'] = pd.to_datetime(df['release_date']).dt.year.astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 14 columns):
id 10865 non-null int64
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
runtime 10865 non-null int64
genres 10865 non-null object
production_companies 10865 non-null object
release_date 10865 non-null int64
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(7), object(3)
memory usage: 1.2+ MB
# check results
df.head(2)
id | popularity | budget | revenue | original_title | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 135397 | 32.985763 | 150000000 | 1513528810 | Jurassic World | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
1 | 76341 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 2015 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
# lets see if everything is correct
df.describe()
id | popularity | budget | revenue | runtime | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 10865.000000 | 10865.000000 | 1.086500e+04 | 1.086500e+04 | 10865.000000 | 10865.000000 | 10865.000000 | 10865.000000 | 10865.000000 | 1.086500e+04 | 1.086500e+04 |
mean | 66066.374413 | 0.646446 | 1.462429e+07 | 3.982690e+07 | 102.071790 | 2004.653659 | 217.399632 | 5.975012 | 2001.321859 | 1.754989e+07 | 5.136900e+07 |
std | 92134.091971 | 1.000231 | 3.091428e+07 | 1.170083e+08 | 31.382701 | 15.555301 | 575.644627 | 0.935138 | 12.813260 | 3.430753e+07 | 1.446383e+08 |
min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 1970.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
25% | 10596.000000 | 0.207575 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 1997.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
50% | 20662.000000 | 0.383831 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 2007.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
75% | 75612.000000 | 0.713857 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 2012.000000 | 146.000000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.370173e+07 |
max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 2069.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
We can see that max of release date is 2069. Which can be possible, lets see which movie is this. This part is not related to the analysis but I am curious and want to check
df.query('release_date == 2069').id.count()
31
Finding Limitations
- As here we will observe that the some movies have ZERO budget or revenue or runtime
Let’s find how much of them have this. Whether it is significant or not
# minimum budget
df.budget.min()
0
df.query('budget == 0').id.count()
5696
# minimum revenue
df.revenue.min()
0
df.query('revenue == 0').id.count()
6016
# minimum runtime
df.runtime.min()
0
df.query('runtime == 0').id.count()
31
As we can clearly see that almost half of the dataset has 0 revenue, or budget. It is one of the limitation of our dataset
Some will also say that all the cast were working for free in these movies and revenue generated was donated, just for fun.
So, I will replace these zero values in the column budget, adj_budget, revenue, adj_revenue with the Mean of the respective column.
I would choosen Median because some popular movies will have way more revenue, budget then normal. Therefore, mean is not a good criteria. Since, median will be also 0. so lets take mean
# replacing 0's with median
df.budget.replace(0,df.budget.mean(), inplace=True)
df.budget_adj.replace(0,df.budget_adj.mean(), inplace=True)
df.revenue.replace(0,df.revenue.mean(), inplace=True)
df.revenue_adj.replace(0,df.revenue_adj.mean(), inplace=True)
Removing Outliers
Let’s remove movies with runtime higher than 300 minutes. They are few and only few watches them so, why to keep them and let our distribution suffer. And I am considering them as outliers here
df.query('runtime > 300').id.count()
22
df.drop(df.query('runtime > 300').index, inplace=True)
# Check if any movie > 320 runtime
df.query('runtime > 300').id.count()
0
# save our data
df.to_csv('tmdb-movies_cleaned.csv', index=False)
Exploratory Data Analysis
The the exploratory part I will be exploring the dataset in search of finding the answers to the below questions.
Does the runtime of the movie is a deciding factor on the popularity it receives?
Lets check the plots for our numerical data
And first we will plot histogram for movie runtime
df = pd.read_csv('tmdb-movies_cleaned.csv')
# plot histogram for runtime
plt.title('Runtime Distribution')
plt.xlabel("Runtime")
plt.ylabel('Movies count')
df.runtime.hist(figsize=(10,10));
Now lets plot a scatter plot for popularity and runtime, from where we’ll be able to see it the runtime affetct the popularity
print(df.runtime.mean())
print(df.runtime.median())
101.293830121
99.0
df.plot.scatter(x='runtime', y='popularity', title='Runtime vs Vote_average', color='b')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9eb3d33d68>
It is almost like a normal distribution.
We can observe that the most of the popular movies are between 100 to 150 minutes, with a peak on 125. Popularity decreases as the movie time decreases increases around the peak. While the mean of the Movie runtime is 101 minutes.
So, Yes runtime of the movie is a deciding factor in the popularity of the movie.
Which gener of movies have the highest revenue collections on average?
df.revenue.describe()
count 1.084300e+04
mean 6.191971e+07
std 1.111289e+08
min 2.000000e+00
25% 3.982690e+07
50% 3.982690e+07
75% 3.982690e+07
max 2.781506e+09
Name: revenue, dtype: float64
# lets check the behaviour of revinue year wise
plt.title('Revenue v/s Year')
plt.xlabel('Year')
plt.ylabel('Revenue')
df.groupby('release_year').revenue.mean().plot(color='g')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9eb3c872e8>
We can clearly see that the revenue is increasing each year except for once between 2000 and 2010. That could be because of the 2008 Recessions but it is out of contex and we should not talk about it here.
So, the revenue is increasing Lets see which genre of movie is talking all the money.
def getgenresList():
# Separate the genre based on the pipe (|).
genresList = df['genres'].str.split('|')
# Create list of genres
genres = []
for i in genresList:
genres.extend(i)
return np.asarray(genres)
There are repetition in our data so lets remove it first
genres = getgenresList()
genres = np.unique(genres)
genres.shape
(21,)
genres[:6]
array(['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary'],
dtype='<U15')
# lets plot them
revenue = {}
for i in genres:
revenue[i] = df[df.genres.str.contains(i)].revenue.sum()
keys = list(revenue.keys())
len(keys)
21
plt.figure(figsize=(20,20))
plt.title('Genre vs Revenue')
plt.xlabel('Movie Genre')
plt.ylabel('Revenue')
plt.bar(list(revenue.keys()), revenue.values())
plt.show()
From the above graph we can observe that the ‘Drama’ genre generates the maximum revenue among all the genres. And second highest is comedy genre followed by Action genre.
Are Action movies more popular than Romance movies?
- First we will plot the number of action movies vs romance movies.
- Then we will find out the popularity concerning each genre and plot it on a histogram
genres_cons = {}
genres_cons['Action'] = df.genres.str.contains('Action').sum()
genres_cons['Romance'] = df.genres.str.contains('Romance').sum()
# finding no. of moves of each type
plt.title('No. of movies Action v/s Romance')
plt.bar(list(genres_cons.keys()), genres_cons.values(), alpha=0.8, color=['r', 'b'])
plt.ylabel('Number of Movies')
plt.legend(loc='upper right')
plt.show()
# finding mean popularity of each type
action_romance = {}
action_romance['Action'] = df[df.genres.str.contains('Action')].popularity.mean()
action_romance['Romance'] = df[df.genres.str.contains('Romance')].popularity.mean()
action_romance
{'Action': 0.92650863979848852, 'Romance': 0.59262124795321625}
plt.bar(list(action_romance.keys()), action_romance.values(), alpha=0.8, label='Popularity of Action, Romance movies')
plt.ylabel('Popularity')
plt.title('Action v/s Romance')
plt.legend(loc='upper right')
plt.show()
Here, we can observe that the Action movies are much more popular than the Romance movies. as the Action movies have an avg popularity of 0.9265 whereas the Romance movies have an avg popularity of 0.5926
Conclusions
After analysing the TBDb dataset I can conclude the following:
- Action movies are much more popular than Romance movies.
- Runtime is deciding factor in the popularity of the movie. Popular movies have almost 125 minutes as runtime. Either less or more will make the movie less popular.
- Drama genre generates the maximum revenue on avg when compared to other genres.
Limitations
After analysing the TBDb dataset I can conclude the following:
- Almost half of the dataset has 0 revenue, budget, revenue_adj and budget_adj.
- I removed some of the outliers from the runtime, which were having runtime greater than 300 min as those were making the data more skew.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
0