Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters. There are some odd characters in the ‘cast’ column. Don’t worry about cleaning them. You can leave them as is. The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
The Movie Database (TMDb) is a community built movie and TV database. Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put simply, we live and breathe community and that's precisely what makes us different.
In this presentation, based on this dataset various questions will be answered for the curious minds. For example who are the most famous actors? Movie genres, how did they change all over the years. What about the revenues, budget and movie popularity, do they correlate weach other. So Lets start exploring the dataset.
#loading necessary libraries
import pandas as pd
import numpy as np
import operator
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Load your data
data = pd.read_csv('tmdb-movies.csv')
The TMdb dataset consists of 10866 rows and 21 columns.
print(data.shape)
Lets find out the name of the columns for this dataset.
print(list(data.columns.values))
Here it is presented the first 10 rows from the TMdb dataset, Lets look at each column. There are id columns as a unique value corresponding to each row - entry, which on its side represents each movie. There are other columns which describes financial values such budget and revenue. Other columns include information like the genre of this movie, the production companies, the release_date, the crowd's votes.
data.head(7)
In this part a trivial data cleaning will be performed. The columns that were selected for the preprocessing will be used for following cells below.
########################################
###removing nan values from cast column, keeping only movies, that has casted actors
###also removing rows where revenue_adj and budget_adj is equal to zero
####################################
data = data[data["cast"].isnull() == False]
data = data[data["genres"].isnull() == False]
data = data[data.budget_adj != 0]
data = data[data.revenue_adj != 0]
In this section basic descriptive statistics are being presented for each column from the dataset.
########################################
###basic descriptive statistics
####################################
data.describe()
Just by looking at the dataset and especially the casted actors per movie, one question that rises is the number of appearances per actor. Paraphasing the question, we can say who are the actors that appeared the most in the movies from the dataset above. The following code snippet finds these actors.
# After discussing the structure of the data and any problems that need to be
# cleaned, perform those cleaning steps in the second part of this section.
######
#Creating a dict for the cast, and how many times each actor casted for movies
######
actor_dict = {}
actors = data["cast"]
actors = actors.str.split("|")
actors = np.array(actors)
for actorList in actors:
#check if there is a problematic list which is just a float
for actor in actorList:
actor = actor.lstrip() #trim the whitespaces
if actor not in actor_dict:
actor_dict[actor] = 1
else:
actor_dict[actor] += 1
sorted_actor_dict = sorted(actor_dict.items(), key = operator.itemgetter(1), reverse = True)
#sorted_actor_dict[0:10]
x_axis = list()
y_axis = list()
for item in sorted_actor_dict[0:20]:
x_axis.append(item[0])
y_axis.append(item[1])
sns.set(rc={'figure.figsize':(12,10)}, font_scale=1.4)
ax = sns.barplot(x_axis, y_axis, palette="Set3")
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='actor names', ylabel='number of appearances', title = 'Top 20 actors based on the number of the appearances in movies')
plt.show()
It is clear from the figure above that the top 5 actors with the biggest number of appearances are Robert De Niro, Samuel L. Jackson, Bruce Willis, Nicolas Cage and Michael Cane.
Lets explore the movies' genres, over the years of the TMDb dataset. This part contains multiple questions. At first lets ask a question which genre was the most popular from year to year over these years. Secondly another question is lets find out how many movies per genre were produced from year to year. Lastly it would be great to plot the number of the movie genres that were produced all over these years. The code snippets below investigates the dataset and produces the answers for the questions mentioned above. To code snippet below prepares a dataframe containing the number of movies from year to year.
year_set = set()
genre_set = set()
genres_and_year = data[["genres", "release_year"]]
#########################
#create a set of unique years of movies
#########################
production_year = genres_and_year["release_year"]
production_year = production_year.drop_duplicates()
for year in production_year:
if year not in year_set:
year_set.add(year)
#print(year_set)
#############################################################
#create a set of unique genres by parsing all the years
#############################################################
for year in year_set:
genre_dict = {}
genres_in_year = genres_and_year[genres_and_year.release_year == year]
genres_in_year = genres_in_year["genres"].values
for elem in genres_in_year:
genres_row = elem.split("|")
for genre in genres_row:
if genre not in genre_set:
genre_set.add(genre)
#print("year:", year, "\n", sorted(genre_dict.items(), key = operator.itemgetter(1), reverse = True))
##########################################################################
#create a dataframe which contains the sum of movies' genre per year
##########################################################################
gerne_count_per_year_df = pd.DataFrame(index = year_set, columns=genre_set)
gerne_count_per_year_df[:] = 0
for year in year_set:
genre_dict = {}
genres_in_year = genres_and_year[genres_and_year.release_year == year]
genres_in_year = genres_in_year["genres"].values
for elem in genres_in_year:
genres_row = elem.split("|")
for genre in genres_row:
if genre not in genre_dict:
genre_dict[genre] = 1
else:
genre_dict[genre] = genre_dict[genre] + 1
aux_df = pd.DataFrame(genre_dict, index = [year])
gerne_count_per_year_df.loc[year, aux_df.columns] = gerne_count_per_year_df.loc[year, aux_df.columns] + aux_df.loc[year]
########################################################
###most popular genre of movies from year to year
########################################################
#print(gerne_count_per_year_df.apply( max, axis=1 ))
#print(gerne_count_per_year_df.idxmax(axis = 1))
most_popular_genre_by_year = pd.DataFrame([gerne_count_per_year_df.idxmax(axis = 1).values,
gerne_count_per_year_df.apply( max, axis=1 ).values],
columns = gerne_count_per_year_df.index,
index = ["genre", 'counts'])
After the execution of the code above, lets see which movie genre was the most popular from year to year and the number of movies that belong to this genre. The following table describes the idea described above. Each year in this table show the most popular movie genre and the number of movies in this category.
most_popular_genre_by_year
The next 2 figures, show the flactuations of movie genres from year to year. Two different plots were used; bar and area plot to visualize the movie genres' changes/flactuations/trends from year to year.
sns.set(rc={'figure.figsize':(12,12)}, font_scale=1.3)
sns.set_palette("Set1", 20, .65)
ax = gerne_count_per_year_df.plot.bar(stacked=True);
ax.set(xlabel='years', ylabel='movies count', title = 'Stacked barplot showing the trend of different movie genres')
plt.show()
ax = gerne_count_per_year_df.plot.area(stacked=True);
ax.set(xlabel='movie titles', ylabel='movies count', title = 'Stacked area plot showing the trend of different movie genres')
plt.show()
In general the number of movies and consequently the movie genres show an increase in numbers from 1960 to 2015. As we can see the majority of the movie genres show an increasing trend. Drama seems to be the most frequent and pervalent genre in movies through all these years. Othe categories such as Thriller, Comedy and Action movies show a similar pattern.
Examining even more the movie genres' some can wonder about the number of movies based on their genre. Thus the next code snippet and the next figure show the number of movies that produced in 1960 to 2015 according to their respective movie genres.
temp = gerne_count_per_year_df.apply(sum)
temp = temp.sort_values(ascending= False)
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(temp.index, temp, palette="Set3")
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie genres', ylabel='movies counts', title = 'Number of Movies according to movie genres')
plt.show()
As we can see, Drama movies are the most frequent movie genre that other genres. In general and according to this dataset The top 3 dominant movie genres all over these years (1960 - 2015) are Drama, Comedy and Thriller.
Moving to other features from the TMDb dataset. It would be beneficial to find out which movies had the highest budget, revenue popularity and average votes. So lets find out which are these top 10 movies based on these attributes.
The following code snippet produce the barplot representing the top 10 movies based on their adjusted revenue.
###
#Top Movies based on different features
###
revenue_dict = {}
#fetching different columns with 2 different ways of code
movies_and_revenue = data[["original_title", "revenue_adj"]]
movies_and_budget = data[['original_title','budget_adj']]
movies_and_popularity = data[['original_title','popularity']]
movies_and_votes= data[['original_title','vote_average']]
#print(movies_and_revenue.sort_values(by="revenue_adj", ascending=False).head(10))
#print("\n")
#print(movies_and_budget.sort_values(by = "budget_adj", ascending = False).head(10))
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.3)
ax = sns.barplot(
movies_and_revenue.sort_values(by = "revenue_adj", ascending=False).head(10).original_title,
movies_and_revenue.sort_values(by = "revenue_adj", ascending=False).head(10).revenue_adj)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='revenue adjusted', title = 'Top 10 movies based on their adjusted revenue')
plt.show()
According to the table above, the top 5 movies from the given dataset based on their adjusted revenue are the followings; Avatar, Star Wars, Titanic, The Exorcist and Jaws.
The following code snippet produce the barplot representing the top 10 movies based on their adjusted budget.
#####
#Top 10 movie with the highest adjusted revenue
#####
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(
movies_and_budget.sort_values(by="budget_adj", ascending=False).head(10).original_title,
movies_and_budget.sort_values(by="budget_adj", ascending=False).head(10).budget_adj)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='budget adjusted', title = 'Top 10 movies based on their adjusted budget')
plt.show()
According to the table above, the top 5 movies from the given dataset based on their adjusted budget are the followings; The Warrior's Way, Pirates of the Caribbean: On Strange Tides, Pirates of the Caribbean: At World's Ends, Superman Returns, Titanic.
The following code snippet produce the barplot representing the top 10 movies based on their popularity.
#####
#Top 10 movie with the highest popularity
#####
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(
movies_and_popularity.sort_values(by="popularity", ascending=False).head(10).original_title,
movies_and_popularity.sort_values(by="popularity", ascending=False).head(10).popularity)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='popularity', title = 'Top 10 movies based on their popularity')
plt.show()
According to the table above, the top 5 movies from the given dataset based on their adjusted budget are the followings; Jurassic World, Mad Max: Fury Road, Interstellar, Guardians of the Galaxy, Insurgent.
The following code snippet produce the barplot representing the top 10 movies based on their average vote.
#####
#Top 10 movie with the highest popularity
#####
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(
movies_and_votes.sort_values(by="vote_average", ascending=False).head(10).original_title,
movies_and_votes.sort_values(by="vote_average", ascending=False).head(10).vote_average)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='average vote', title = 'Top 10 movies based on their average vote')
plt.show()
According to the table above, the top 5 movies from the given dataset based on their adjusted budget are the followings; The Shawshank Redemption, Stop Making Sense, The Godfather, Whiplash and Pulp Fiction.
Although the beautiful plots, from the figures above, someone can mention that It would have been expected that there will be a correlation between the top movies and especially the top 5 based on the previous attributes (adjusted_revenue, adjusted_budget, popularity and average votes). At least someone would assumed that the top 5 movies from feature to feature would be the same. On the contrary this idea does not appear in the previous figures.
Let's move to somewhere else. There is some curiosity about the movies' average votes. Lets see their distribution. The following code creates a boxplot which illustrates their mean which is about 6. Also two plots were created; one with the distribution of the ratings from 1960 to 2015 and another with the ratings distribution from by year.
#####
#movie ratings' distribution all over the years
#####
sns.set(rc={'figure.figsize':(15,15)}, font_scale=1.3)
temp_df = data[["vote_average"]]
sns.set_style("whitegrid")
ax = sns.distplot(temp_df.vote_average)
ax = sns.boxplot(x = temp_df.vote_average)
ax.set(xlabel='average votes', title = 'average votes distribution')
plt.show()
The previous question shows that the mean of the ratings all over these years (1960 - 2015) are almost 6. What about the ratings at a specific year. The following snippet code creates a plot showing the ratings distributions per year.
#####
#movie ratings' distributions per year
#####
sns.set(rc={'figure.figsize':(15,15)}, font_scale=1.3)
temp_df = data[["release_year", "vote_average"]]
sns.set_style("whitegrid")
ax = sns.violinplot(x = temp_df.vote_average, y = temp_df.release_year, orient ="h")
ax.set(xlabel='movie ratings distributions', ylabel='years', title = 'movie ratings distributions per year')
plt.show()
The previous figure illustrates that all the years have mean ratings about 6 to 6.5. However some exclusions such as the year 1974 has mean ratings around 7. It seems that during that time great movies with high impact on the crowd were produced.
This section deals with the correlations. The creation of this part was inspired by the Question 3 part, where we were looking at the top 5 movies based on some characteristics (adjusted revenue, adjusted budget, popularity and average votes). We were expecting that regardless the features the top 5 movies would be the same. However this notion did not appear. So to investigate it even more scatterplots and correlations between the adjusted revenue, the adjusted budget, movies' popularity and vote average were produced. The code below produce scatterplots with pairs of these 4 variables.
#####
#correlation plots
#####
#get
aux_df = data[['revenue_adj', 'budget_adj', 'popularity', 'vote_average']]
sns.set(rc={'figure.figsize':(15,15)}, font_scale=1.3, style="ticks")
f1 = sns.jointplot(x = "budget_adj", y = "revenue_adj", kind = "scatter", data = aux_df)
f1.fig.suptitle('scatterplot and correlation for budget_adj and revenue_adj')
f2 = sns.jointplot(x = "budget_adj", y = "popularity", kind = "scatter", data = aux_df)
f2.fig.suptitle('scatterplot and correlation for budget_adj and popularity')
f3 = sns.jointplot(x = "budget_adj", y = "vote_average", kind = "scatter", data = aux_df)
f3.fig.suptitle('scatterplot and correlation for budget_adj and vote_average')
f4 = sns.jointplot(x = "revenue_adj", y = "popularity", kind = "scatter", data = aux_df)
f4.fig.suptitle('scatterplot and correlation for revenue_adj and popularity')
f5 = sns.jointplot(x = "revenue_adj", y = "vote_average", kind = "scatter", data = aux_df)
f5.fig.suptitle('scatterplot and correlation for revenue_adj and vote_average')
f6 = sns.jointplot(x = "popularity", y = "vote_average", kind = "scatter", data = aux_df)
f6.fig.suptitle('scatterplot and correlation for popularity and vote_average')
According to pearson coefficient there is a positive correlation between the adjasted revenue, adjasted budget and popularity. Moroever there is a weak positive correlation between the average votes with the other 3 variables (adjasted revenue, adjasted budget and popularity)
If we want to see all these relations in a single plot, seaborn's pairplot can provide this functionality
f1 = sns.pairplot(aux_df, kind="reg", diag_kind="kde", diag_kws=dict(shade=True))
f1.fig.suptitle('scatterplots for budget_adj, revenue_adj, popularity and vote_average\n')
f1.fig.tight_layout(rect=[0, 0.03, 1, 0.95])
This dataset is very rich in information. Some limitations the dataset contains are null and zero values in some features. These zero and null values hinders the analysis and have to be removed the rows that correspond to these zero and null values. Fore example null values is an obstacle which stopped me when I was analyzing the top casted actors. Furthermore zero values creates false results during the correlation plots and computing the pearson correlation. Hence data cleaning is a necessary part before moving on to the dataset's investigation. There are many famous actors like Robert De Niro who casted in many films all over these years. There are 20 unique movie genres but drama is the one that show an increasing trend all over these years. Finally there is a positive correlation between some of the features of the TMDb dataset.
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])