Investigate TBDb Movie Data

Table of Contents

Introduction

TMDB is a movie metadata database, it contains the details of 5000 hollywood Moves. The data set I have picked from Kaggle and it is cleaned from the original dataset.

I would like to pose the following questions and Investigate them.

  • Does the runtime of the movie is a deciding factor on the popularity it receives?
  • Which gener of movies have the highest revenue collections on average?
  • Are action movies more popular than romance movies?
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline

Data Wrangling

In this section of we will collect the data from https://www.kaggle.com/tmdb/tmdb-movie-metadata and load it and make it use able and according to the need to answer our question

General Properties

# Load your data and print out a few lines. Perform operations to inspect data
# types and look for instances of missing or possibly errant data.
df = pd.read_csv('tmdb-movies.csv')
df.head(3)
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08

3 rows × 21 columns

# check the distribution of various features
df.describe()
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

Data Cleaning

For cleaning the data I have done the following steps.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
# remove the columns which are not needed
df.drop(['imdb_id', 'homepage', 'keywords', 'overview', 'director','tagline', 'cast'], axis=1, inplace=True)
df.head(1)
id popularity budget revenue original_title runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 32.985763 150000000 1513528810 Jurassic World 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
# find the duplicates
df.duplicated().sum()
1

Lucikly, we have only one duplicate. We will simply drop it

# drop duplicates
df.drop_duplicates(inplace=True)

#check if any duplicate
df.duplicated().sum()
0
# check for nan values
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 14 columns):
id                      10865 non-null int64
popularity              10865 non-null float64
budget                  10865 non-null int64
revenue                 10865 non-null int64
original_title          10865 non-null object
runtime                 10865 non-null int64
genres                  10842 non-null object
production_companies    9835 non-null object
release_date            10865 non-null object
vote_count              10865 non-null int64
vote_average            10865 non-null float64
release_year            10865 non-null int64
budget_adj              10865 non-null float64
revenue_adj             10865 non-null float64
dtypes: float64(4), int64(6), object(4)
memory usage: 1.2+ MB
  • Here, we can observe that there are unfilled/NaN values in production_companies, genres
  • Let’s fill these values with ‘Unknown’
  • Only a few (13) movies will have ‘Unknown’ generes
# fill values to nan cells
df.production_companies.fillna('Unknown', inplace=True)
df.genres.fillna('Unknown', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 14 columns):
id                      10865 non-null int64
popularity              10865 non-null float64
budget                  10865 non-null int64
revenue                 10865 non-null int64
original_title          10865 non-null object
runtime                 10865 non-null int64
genres                  10865 non-null object
production_companies    10865 non-null object
release_date            10865 non-null object
vote_count              10865 non-null int64
vote_average            10865 non-null float64
release_year            10865 non-null int64
budget_adj              10865 non-null float64
revenue_adj             10865 non-null float64
dtypes: float64(4), int64(6), object(4)
memory usage: 1.2+ MB

Now, all the values filled up and irrevelant colums removed. Now lets fix the data types

df.release_date.isna().sum()
0
# Extract year from the release_date using datetime and convert it to int

df['release_date'] = pd.to_datetime(df['release_date']).dt.year.astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 14 columns):
id                      10865 non-null int64
popularity              10865 non-null float64
budget                  10865 non-null int64
revenue                 10865 non-null int64
original_title          10865 non-null object
runtime                 10865 non-null int64
genres                  10865 non-null object
production_companies    10865 non-null object
release_date            10865 non-null int64
vote_count              10865 non-null int64
vote_average            10865 non-null float64
release_year            10865 non-null int64
budget_adj              10865 non-null float64
revenue_adj             10865 non-null float64
dtypes: float64(4), int64(7), object(3)
memory usage: 1.2+ MB
# check results
df.head(2)
id popularity budget revenue original_title runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 32.985763 150000000 1513528810 Jurassic World 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 28.419936 150000000 378436354 Mad Max: Fury Road 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 2015 6185 7.1 2015 1.379999e+08 3.481613e+08
# lets see if everything is correct
df.describe()
id popularity budget revenue runtime release_date vote_count vote_average release_year budget_adj revenue_adj
count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000 10865.000000 1.086500e+04 1.086500e+04
mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 2004.653659 217.399632 5.975012 2001.321859 1.754989e+07 5.136900e+07
std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 15.555301 575.644627 0.935138 12.813260 3.430753e+07 1.446383e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 1970.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 1997.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 2007.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 2012.000000 146.000000 6.600000 2011.000000 2.085325e+07 3.370173e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 2069.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09

We can see that max of release date is 2069. Which can be possible, lets see which movie is this. This part is not related to the analysis but I am curious and want to check

df.query('release_date == 2069').id.count()
31

Finding Limitations

# minimum budget
df.budget.min()
0
df.query('budget == 0').id.count()
5696
# minimum revenue
df.revenue.min()
0
df.query('revenue == 0').id.count()
6016
# minimum runtime
df.runtime.min()
0
df.query('runtime == 0').id.count()
31

As we can clearly see that almost half of the dataset has 0 revenue, or budget. It is one of the limitation of our dataset

Some will also say that all the cast were working for free in these movies and revenue generated was donated, just for fun.

So, I will replace these zero values in the column budget, adj_budget, revenue, adj_revenue with the Mean of the respective column.

I would choosen Median because some popular movies will have way more revenue, budget then normal. Therefore, mean is not a good criteria. Since, median will be also 0. so lets take mean

# replacing 0's with median
df.budget.replace(0,df.budget.mean(), inplace=True)

df.budget_adj.replace(0,df.budget_adj.mean(), inplace=True)

df.revenue.replace(0,df.revenue.mean(), inplace=True)

df.revenue_adj.replace(0,df.revenue_adj.mean(), inplace=True)

Removing Outliers

Let’s remove movies with runtime higher than 300 minutes. They are few and only few watches them so, why to keep them and let our distribution suffer. And I am considering them as outliers here

df.query('runtime > 300').id.count()
22
df.drop(df.query('runtime > 300').index, inplace=True)

# Check if any movie > 320 runtime
df.query('runtime > 300').id.count()
0
# save our data
df.to_csv('tmdb-movies_cleaned.csv', index=False)

Exploratory Data Analysis

The the exploratory part I will be exploring the dataset in search of finding the answers to the below questions.

Does the runtime of the movie is a deciding factor on the popularity it receives?

Lets check the plots for our numerical data
And first we will plot histogram for movie runtime

df = pd.read_csv('tmdb-movies_cleaned.csv')
# plot histogram for runtime
plt.title('Runtime Distribution')
plt.xlabel("Runtime")
plt.ylabel('Movies count')
df.runtime.hist(figsize=(10,10));

png

Now lets plot a scatter plot for popularity and runtime, from where we’ll be able to see it the runtime affetct the popularity

print(df.runtime.mean())
print(df.runtime.median())
101.293830121
99.0
df.plot.scatter(x='runtime', y='popularity', title='Runtime vs Vote_average', color='b')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9eb3d33d68>

png

It is almost like a normal distribution.
We can observe that the most of the popular movies are between 100 to 150 minutes, with a peak on 125. Popularity decreases as the movie time decreases increases around the peak. While the mean of the Movie runtime is 101 minutes. So, Yes runtime of the movie is a deciding factor in the popularity of the movie.


Which gener of movies have the highest revenue collections on average?

df.revenue.describe()
count    1.084300e+04
mean     6.191971e+07
std      1.111289e+08
min      2.000000e+00
25%      3.982690e+07
50%      3.982690e+07
75%      3.982690e+07
max      2.781506e+09
Name: revenue, dtype: float64
# lets check the behaviour of revinue year wise
plt.title('Revenue v/s Year')
plt.xlabel('Year')
plt.ylabel('Revenue')
df.groupby('release_year').revenue.mean().plot(color='g')
<matplotlib.axes._subplots.AxesSubplot at 0x7f9eb3c872e8>

png

We can clearly see that the revenue is increasing each year except for once between 2000 and 2010. That could be because of the 2008 Recessions but it is out of contex and we should not talk about it here.
So, the revenue is increasing Lets see which genre of movie is talking all the money.

def getgenresList():
    # Separate the genre based on the pipe (|).
    genresList = df['genres'].str.split('|')
    # Create list of genres
    genres = []
    for i in genresList:
        genres.extend(i)
    return np.asarray(genres)

There are repetition in our data so lets remove it first

genres = getgenresList()
genres = np.unique(genres)
genres.shape
(21,)
genres[:6]
array(['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary'], 
      dtype='<U15')
# lets plot them
revenue = {}
for i in genres:
    revenue[i] = df[df.genres.str.contains(i)].revenue.sum()
keys = list(revenue.keys())
len(keys)
21
plt.figure(figsize=(20,20))
plt.title('Genre vs Revenue')
plt.xlabel('Movie Genre')
plt.ylabel('Revenue')
plt.bar(list(revenue.keys()), revenue.values())
plt.show()

png

From the above graph we can observe that the ‘Drama’ genre generates the maximum revenue among all the genres. And second highest is comedy genre followed by Action genre.

genres_cons = {}
genres_cons['Action'] = df.genres.str.contains('Action').sum()
genres_cons['Romance'] = df.genres.str.contains('Romance').sum()
# finding no. of moves of each type
plt.title('No. of movies Action v/s Romance')
plt.bar(list(genres_cons.keys()), genres_cons.values(), alpha=0.8, color=['r', 'b'])
plt.ylabel('Number of Movies')
plt.legend(loc='upper right')
plt.show()

png

# finding mean popularity of each type
action_romance = {}
action_romance['Action'] = df[df.genres.str.contains('Action')].popularity.mean()
action_romance['Romance'] = df[df.genres.str.contains('Romance')].popularity.mean()
action_romance
{'Action': 0.92650863979848852, 'Romance': 0.59262124795321625}
plt.bar(list(action_romance.keys()), action_romance.values(), alpha=0.8, label='Popularity of Action, Romance movies')
plt.ylabel('Popularity')
plt.title('Action v/s Romance')
plt.legend(loc='upper right')
plt.show()

png

Here, we can observe that the Action movies are much more popular than the Romance movies. as the Action movies have an avg popularity of 0.9265 whereas the Romance movies have an avg popularity of 0.5926

Conclusions

After analysing the TBDb dataset I can conclude the following:

Limitations

After analysing the TBDb dataset I can conclude the following:

from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])
0