Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters. There are some odd characters in the ‘cast’ column. Don’t worry about cleaning them. You can leave them as is. The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
The Movie Database (TMDb) is a community built movie and TV database. Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put simply, we live and breathe community and that's precisely what makes us different.
In this presentation, based on this dataset various questions will be answered for the curious minds. For example who are the most famous actors? Movie genres, how did they change all over the years. What about the revenues, budget and movie popularity, do they correlate weach other. So Lets start exploring the dataset.
#loading necessary libraries
import pandas as pd
import numpy as np
import operator
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Load your data
data = pd.read_csv('tmdb-movies.csv')
The TMdb dataset consists of 10866 rows and 21 columns.
print(data.shape)
Lets find out the name of the columns for this dataset.
print(list(data.columns.values))
Here it is presented the first 10 rows from the TMdb dataset, Lets look at each column. There are id columns as a unique value corresponding to each row - entry, which on its side represents each movie. There are other columns which describes financial values such budget and revenue. Other columns include information like the genre of this movie, the production companies, the release_date, the crowd's votes.
data.head(7)
In this part a trivial data cleaning will be performed. The columns that were selected for the preprocessing will be used for following cells below.
########################################
###removing nan values from cast column, keeping only movies, that has casted actors
###also removing rows where revenue_adj and budget_adj is equal to zero
####################################
data = data[data["cast"].isnull() == False]
data = data[data["genres"].isnull() == False]
data = data[data.budget_adj != 0]
data = data[data.revenue_adj != 0]
In this section basic descriptive statistics are being presented for each column from the dataset.
########################################
###basic descriptive statistics
####################################
data.describe()
Just by looking at the dataset and especially the casted actors per movie, one question that rises is the number of appearances per actor. Paraphasing the question, we can say who are the actors that appeared the most in the movies from the dataset above. The following code snippet finds these actors.
# After discussing the structure of the data and any problems that need to be
# cleaned, perform those cleaning steps in the second part of this section.
######
#Creating a dict for the cast, and how many times each actor casted for movies
######
actor_dict = {}
actors = data["cast"]
actors = actors.str.split("|")
actors = np.array(actors)
for actorList in actors:
#check if there is a problematic list which is just a float
for actor in actorList:
actor = actor.lstrip() #trim the whitespaces
if actor not in actor_dict:
actor_dict[actor] = 1
else:
actor_dict[actor] += 1
sorted_actor_dict = sorted(actor_dict.items(), key = operator.itemgetter(1), reverse = True)
#sorted_actor_dict[0:10]
x_axis = list()
y_axis = list()
for item in sorted_actor_dict[0:20]:
x_axis.append(item[0])
y_axis.append(item[1])
sns.set(rc={'figure.figsize':(12,10)}, font_scale=1.4)
ax = sns.barplot(x_axis, y_axis, palette="Set3")
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='actor names', ylabel='number of appearances', title = 'Top 20 actors based on the number of the appearances in movies')
plt.show()
It is clear from the figure above that the top 5 actors with the biggest number of appearances are Robert De Niro, Samuel L. Jackson, Bruce Willis, Nicolas Cage and Michael Cane.
Lets explore the movies' genres, over the years of the TMDb dataset. This part contains multiple questions. At first lets ask a question which genre was the most popular from year to year over these years. Secondly another question is lets find out how many movies per genre were produced from year to year. Lastly it would be great to plot the number of the movie genres that were produced all over these years. The code snippets below investigates the dataset and produces the answers for the questions mentioned above. To code snippet below prepares a dataframe containing the number of movies from year to year.
year_set = set()
genre_set = set()
genres_and_year = data[["genres", "release_year"]]
#########################
#create a set of unique years of movies
#########################
production_year = genres_and_year["release_year"]
production_year = production_year.drop_duplicates()
for year in production_year:
if year not in year_set:
year_set.add(year)
#print(year_set)
#############################################################
#create a set of unique genres by parsing all the years
#############################################################
for year in year_set:
genre_dict = {}
genres_in_year = genres_and_year[genres_and_year.release_year == year]
genres_in_year = genres_in_year["genres"].values
for elem in genres_in_year:
genres_row = elem.split("|")
for genre in genres_row:
if genre not in genre_set:
genre_set.add(genre)
#print("year:", year, "\n", sorted(genre_dict.items(), key = operator.itemgetter(1), reverse = True))
##########################################################################
#create a dataframe which contains the sum of movies' genre per year
##########################################################################
gerne_count_per_year_df = pd.DataFrame(index = year_set, columns=genre_set)
gerne_count_per_year_df[:] = 0
for year in year_set:
genre_dict = {}
genres_in_year = genres_and_year[genres_and_year.release_year == year]
genres_in_year = genres_in_year["genres"].values
for elem in genres_in_year:
genres_row = elem.split("|")
for genre in genres_row:
if genre not in genre_dict:
genre_dict[genre] = 1
else:
genre_dict[genre] = genre_dict[genre] + 1
aux_df = pd.DataFrame(genre_dict, index = [year])
gerne_count_per_year_df.loc[year, aux_df.columns] = gerne_count_per_year_df.loc[year, aux_df.columns] + aux_df.loc[year]
########################################################
###most popular genre of movies from year to year
########################################################
#print(gerne_count_per_year_df.apply( max, axis=1 ))
#print(gerne_count_per_year_df.idxmax(axis = 1))
most_popular_genre_by_year = pd.DataFrame([gerne_count_per_year_df.idxmax(axis = 1).values,
gerne_count_per_year_df.apply( max, axis=1 ).values],
columns = gerne_count_per_year_df.index,
index = ["genre", 'counts'])
After the execution of the code above, lets see which movie genre was the most popular from year to year and the number of movies that belong to this genre. The following table describes the idea described above. Each year in this table show the most popular movie genre and the number of movies in this category.
most_popular_genre_by_year
The next 2 figures, show the flactuations of movie genres from year to year. Two different plots were used; bar and area plot to visualize the movie genres' changes/flactuations/trends from year to year.
sns.set(rc={'figure.figsize':(12,12)}, font_scale=1.3)
sns.set_palette("Set1", 20, .65)
ax = gerne_count_per_year_df.plot.bar(stacked=True);
ax.set(xlabel='years', ylabel='movies count', title = 'Stacked barplot showing the trend of different movie genres')
plt.show()
ax = gerne_count_per_year_df.plot.area(stacked=True);
ax.set(xlabel='movie titles', ylabel='movies count', title = 'Stacked area plot showing the trend of different movie genres')
plt.show()
In general the number of movies and consequently the movie genres show an increase in numbers from 1960 to 2015. As we can see the majority of the movie genres show an increasing trend. Drama seems to be the most frequent and pervalent genre in movies through all these years. Othe categories such as Thriller, Comedy and Action movies show a similar pattern.
Examining even more the movie genres' some can wonder about the number of movies based on their genre. Thus the next code snippet and the next figure show the number of movies that produced in 1960 to 2015 according to their respective movie genres.
temp = gerne_count_per_year_df.apply(sum)
temp = temp.sort_values(ascending= False)
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(temp.index, temp, palette="Set3")
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie genres', ylabel='movies counts', title = 'Number of Movies according to movie genres')
plt.show()
As we can see, Drama movies are the most frequent movie genre that other genres. In general and according to this dataset The top 3 dominant movie genres all over these years (1960 - 2015) are Drama, Comedy and Thriller.
Moving to other features from the TMDb dataset. It would be beneficial to find out which movies had the highest budget, revenue popularity and average votes. So lets find out which are these top 10 movies based on these attributes.
The following code snippet produce the barplot representing the top 10 movies based on their adjusted revenue.
###
#Top Movies based on different features
###
revenue_dict = {}
#fetching different columns with 2 different ways of code
movies_and_revenue = data[["original_title", "revenue_adj"]]
movies_and_budget = data[['original_title','budget_adj']]
movies_and_popularity = data[['original_title','popularity']]
movies_and_votes= data[['original_title','vote_average']]
#print(movies_and_revenue.sort_values(by="revenue_adj", ascending=False).head(10))
#print("\n")
#print(movies_and_budget.sort_values(by = "budget_adj", ascending = False).head(10))
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.3)
ax = sns.barplot(
movies_and_revenue.sort_values(by = "revenue_adj", ascending=False).head(10).original_title,
movies_and_revenue.sort_values(by = "revenue_adj", ascending=False).head(10).revenue_adj)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='revenue adjusted', title = 'Top 10 movies based on their adjusted revenue')
plt.show()
According to the table above, the top 5 movies from the given dataset based on their adjusted revenue are the followings; Avatar, Star Wars, Titanic, The Exorcist and Jaws.
The following code snippet produce the barplot representing the top 10 movies based on their adjusted budget.
#####
#Top 10 movie with the highest adjusted revenue
#####
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(
movies_and_budget.sort_values(by="budget_adj", ascending=False).head(10).original_title,
movies_and_budget.sort_values(by="budget_adj", ascending=False).head(10).budget_adj)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='budget adjusted', title = 'Top 10 movies based on their adjusted budget')
plt.show()
According to the table above, the top 5 movies from the given dataset based on their adjusted budget are the followings; The Warrior's Way, Pirates of the Caribbean: On Strange Tides, Pirates of the Caribbean: At World's Ends, Superman Returns, Titanic.
The following code snippet produce the barplot representing the top 10 movies based on their popularity.
#####
#Top 10 movie with the highest popularity
#####
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(
movies_and_popularity.sort_values(by="popularity", ascending=False).head(10).original_title,
movies_and_popularity.sort_values(by="popularity", ascending=False).head(10).popularity)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='popularity', title = 'Top 10 movies based on their popularity')
plt.show()
According to the table above, the top 5 movies from the given dataset based on their adjusted budget are the followings; Jurassic World, Mad Max: Fury Road, Interstellar, Guardians of the Galaxy, Insurgent.
The following code snippet produce the barplot representing the top 10 movies based on their average vote.
#####
#Top 10 movie with the highest popularity
#####
sns.set(rc={'figure.figsize':(12,9)}, font_scale=1.4)
ax = sns.barplot(
movies_and_votes.sort_values(by="vote_average", ascending=False).head(10).original_title,
movies_and_votes.sort_values(by="vote_average", ascending=False).head(10).vote_average)
#rotate x-axis' text
for item in ax.get_xticklabels():
item.set_rotation(85)
ax.set(xlabel='movie titles', ylabel='average vote', title = 'Top 10 movies based on their average vote')
plt.show()