The Enron scandal was a financial scandal that eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas, and the de facto dissolution of Arthur Andersen, which was one of the five largest audit and accountancy partnerships in the world. In addition to being the largest bankruptcy reorganization in American history at that time, Enron was cited as the biggest audit failure.
Enron was formed in 1985 by Kenneth Lay after merging Houston Natural Gas and InterNorth. Several years later, when Jeffrey Skilling was hired, he developed a staff of executives that – by the use of accounting loopholes, special purpose entities, and poor financial reporting – were able to hide billions of dollars in debt from failed deals and projects. Chief Financial Officer Andrew Fastow and other executives not only misled Enron's Board of Directors and Audit Committee on high-risk accounting practices, but also pressured Arthur Andersen to ignore the issues.
Enron shareholders filed a 40 billion dollar lawsuit after the company's stock price, which achieved a high of US90.75 per share in mid-2000, plummeted to less than 1 dollar by the end of November 2001. The U.S. Securities and Exchange Commission (SEC) began an investigation, and rival Houston competitor Dynegy offered to purchase the company at a very low price. The deal failed, and on December 2, 2001, Enron filed for bankruptcy under Chapter 11 of the United States Bankruptcy Code. Enron's 63.4 billion dollars in assets made it the largest corporate bankruptcy in U.S. history until WorldCom's bankruptcy the next year.
Many executives at Enron were indicted for a variety of charges and some were later sentenced to prison. Enron's auditor, Arthur Andersen, was found guilty in a United States District Court of illegally destroying documents relevant to the SEC investigation which voided its license to audit public companies, effectively closing the business. By the time the ruling was overturned at the U.S. Supreme Court, the company had lost the majority of its customers and had ceased operating. Enron employees and shareholders received limited returns in lawsuits, despite losing billions in pensions and stock prices. As a consequence of the scandal, new regulations and legislation were enacted to expand the accuracy of financial reporting for public companies. One piece of legislation, the Sarbanes–Oxley Act, increased penalties for destroying, altering, or fabricating records in federal investigations or for attempting to defraud shareholders. The act also increased the accountability of auditing firms to remain unbiased and independent of their clients (source).
The goal of this project is to use the Enron dataset to train our machine learning algorithm to detect the possiblity of fraud (identify person's of interest.) Since we know our persons of interest (POIs) in our dataset, we will be able to use supervised learning algorithms in constructing our POI identifier. This will be done by picking the features within our dataset that separate our POIs from our non-POIs best.
We will start out our analysis by answering some questions about our data. Then, we will explore our features further by visualizing any correlations/outliers. Next, we will transform/scale our features and select those that will be most useful in our POI identifier, engineering new features and adding them to the dataset if provided to be useful for our analysis. We will identify at least two algorithms that may be best suited for our particular set of data and test them, tuning our parameters until optimal performance is reached. In our final analysis, the algorithm we have fit will be validated using our training/testing data. Using performance metrics to evaluate our results, any problems will be addressed and motifications made. In our final thoughts, the performance of our final algorithm will be discussed.
In this section we will introduce the Enron dataset, load vital libraries, apply an introductory exploration and data preprocess the data by cleaning noisy data and tranforming features' types.
"""Loading the necessary libraries and the enron dataset"""
import pickle
import sklearn
import pandas as pd
import numpy as np
import pprint
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
enron_dict = pickle.load(open("final_project_dataset.pkl", "rb"))
for key in enron_dict.keys():
print(key)
pprint.pprint(enron_dict["METTS MARK"])
print('Number of People in Dataset: ', len(enron_dict))
# coverting dictionary to pandas dataframe
df = pd.DataFrame.from_dict(enron_dict, orient='index')
df["name"] = df.index
df.index = range(len(df))
Based on the pdf file: enron61702insiderpay.pdf the entries TOTAL and THE TRAVEL AGENCY IN THE PARK does not contain any vital information and is noise for our dataframe, hence they have to be removed.
# removing unecessary entries
df = df[-(df["name"] == "TOTAL")]
df = df[-(df["name"] == "THE TRAVEL AGENCY IN THE PARK")]
# inspecting the data types for all the features
dtype_df = df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df.groupby("Column Type").aggregate('count').reset_index()
There is a problem, all the numeric features are in object format they must be turned to numeric.
#printing all the types for all the features in the dataframe
print(df.dtypes)
#df.loc[:, df.columns != 'name'] = df.loc[:, df.columns != 'name'].apply(pd.to_numeric, errors='coerce')
df[df.columns[~df.columns.isin(['name','email_address', 'poi'])]] = df[df.columns[~df.columns.isin(['name','email_address', 'poi'])]].apply(pd.to_numeric, errors='coerce')
df.head()
# inspecting the data types for all the features
dtype_df = df.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df.groupby("Column Type").aggregate('count').reset_index()
bool datatype is the POI feature, the 2 object datatypes are the features email_address and name and finally the rest 19 features are the financial and email features.
df.describe()
At first glance it seems that many features have different ranges from one each other. This indicates that in the feature we may have to scale with a MinMaxScaler from sklearn for the machine learning algorithms to fit the cases even better.
na_df = df.isna().sum()
na_df.sort_values()
There are numerous NaNs in both the email and financial fields. According to the official pdf documentation for the financial (payment and stock) data, values of NaN represent 0 and not unknown quantities. However, for the email data, NaNs are unknown information. Therefore, I will replace any financial data that is NaN with a 0 but will fill in the NaNs for the email data with the median of the column grouped by person of interest. In other words, if a person has a NaN value for ‘to_messages’, and they are a person of interest, I will fill in that value with the mean value of ‘to_messages’ for a person of interest. If I chose to drop all NaNs, that would reduce the size of what is already a small dataset. As the quality of a machine learning model is proportional to the amount of data fed into it, I am hesitant to remove any information that could possibly be of use.
from sklearn.preprocessing import Imputer
# Fill in the NaN payment and stock values with zero
df.salary = df.salary.fillna(0)
df.bonus = df.bonus.fillna(0)
df.long_term_incentive = df.long_term_incentive.fillna(0)
df.deferred_income = df.deferred_income.fillna(0)
df.deferral_payments = df.deferral_payments.fillna(0)
df.other = df.other.fillna(0)
df.expenses = df.expenses.fillna(0)
df.director_fees = df.director_fees.fillna(0)
df.total_payments = df.total_payments.fillna(0)
df.exercised_stock_options = df.exercised_stock_options.fillna(0)
df.restricted_stock = df.restricted_stock.fillna(0)
df.restricted_stock_deferred = df.restricted_stock_deferred.fillna(0)
df.total_stock_value = df.total_stock_value.fillna(0)
df.director_fees = df.director_fees.fillna(0)
df.loan_advances = df.loan_advances.fillna(0)
# Fill in the NaN email data with the mean of column grouped by poi/ non_poi
imp = Imputer(missing_values='NaN', strategy = 'median', axis=0)
df_poi = df[df['poi'] == True]
df_nonpoi = df[df['poi']==False]
df_poi.loc[:, "from_poi_to_this_person"] = imp.fit_transform(df_poi[["from_poi_to_this_person"]]).ravel()
df_nonpoi.loc[:, "from_poi_to_this_person"] = imp.fit_transform(df_nonpoi[["from_poi_to_this_person"]]).ravel()
df_poi.loc[:, "from_messages"] = imp.fit_transform(df_poi[["from_messages"]]).ravel()
df_nonpoi.loc[:, "from_messages"] = imp.fit_transform(df_nonpoi[["from_messages"]]).ravel()
df_poi.loc[:, "to_messages"] = imp.fit_transform(df_poi[["to_messages"]]).ravel()
df_nonpoi.loc[:, "to_messages"] = imp.fit_transform(df_nonpoi[["to_messages"]]).ravel()
df_poi.loc[:, "from_this_person_to_poi"] = imp.fit_transform(df_poi[["from_this_person_to_poi"]]).ravel()
df_nonpoi.loc[:, "from_this_person_to_poi"] = imp.fit_transform(df_nonpoi[["from_this_person_to_poi"]]).ravel()
df_poi.loc[:, "shared_receipt_with_poi"] = imp.fit_transform(df_poi[["shared_receipt_with_poi"]]).ravel()
df_nonpoi.loc[:, "shared_receipt_with_poi"] = imp.fit_transform(df_nonpoi[["shared_receipt_with_poi"]]).ravel()
df = df_poi.append(df_nonpoi)
df = df.sort_index()
df.head()
The dataset in general, is consisted of the following variables:
The financial features
Payments
'salary', Reflects items such as base salary, executive cash allowances, and benefits payments
'bonus', Reflects annual cash incentives paid based upon company performance. Also may include other retention payments.
'long_term_incentive', Reflects long-term incentive cash payments from various long-term incentive programs designed to tie executive compensation to long-term success as measured against key performance drivers and business objectives over a multi-year period, generally 3 to 5 years.
'deferred_income', Reflects voluntary executive deferrals of salary, annual cash incentives, and long-term cash incentives as well as cash fees deferred by non-employee directors under a deferred compensation arrangement. May also reflect deferrals under a stock option or phantom stock unit in lieu of cash arrangement.
'deferral_payments', Reflects distributions from a deferred compensation arrangement due to termination of employment or due to in-service withdrawals as per plan provisions.
'loan_advances', Reflects total amount of loan advances, excluding repayments, provided by the Debtor in return for a promise of repayment. In certain instances, the terms of the promissory notes allow for the option to repay with stock of the company.
'other', Reflects items such as payments for severence, consulting services, relocation costs, tax advances and allowances for employees on international assignment (i.e. housing allowances, cost of living allowances, payments under Enron’s Tax Equalization Program, etc.). May also include payments provided with respect to employment agreements, as well as imputed income amounts for such things as use of corporate aircraft.
'expenses', Reflects reimbursements of business expenses. May include fees paid for consulting services.
'director_fees', Reflects cash payments and/or value of stock grants made in lieu of cash payments to non-employee directors.
'total_payments'
Stock Value In 1998, 1999 and 2000, Debtor and non-debtor affiliates were charged for options granted. The Black-Scholes method was used to determine the amount to be charged. Any amounts charged to Debtor and non-debtor affiliates associated with the options exercised related to these three years have not been subtracted from the share value amounts shown.
'exercised_stock_options', Reflects amounts from exercised stock options which equal the market value in excess of the exercise price on the date the options were exercised either through cashless (same-day sale), stock swap or cash exercises. The reflected gain may differ from that realized by the insider due to fluctuations in the market price and the timing of any subsequent sale of the securities.
'restricted_stock', Reflects the gross fair market value of shares and accrued dividends (and/or phantom units and dividend equivalents) on the date of release due to lapse of vesting periods, regardless of whether deferred.
'restricted_stock_deferred', Reflects value of restricted stock voluntarily deferred prior to release under a deferred compensation arrangement
'total_stock_value'
The email features
'from_poi_to_this_person'
'from_messages'
'from_this_person_to_poi'
'shared_receipt_with_poi'
'to_messages'
Non infomative features
'email_address'
'name'
Feature of Interest / Target Variable
In this section, I will investigate its feature one by one, inspect its histogram, outliers and view its descriptive statistics such as (mean, mode, median, variance, std. devation, skewness, kurtosis etc.).
#number of POI in dataset
print('There are 18 POI in our Dataset as you can see by our "True" count')
df.poi.value_counts()
f, ax = plt.subplots()
#sns.set(style="white", context="talk")
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(font_scale=1.5)
sns.distplot(df.poi, kde=False, color='green', ax=ax).set_title("POIs Distribution")
ax.set(ylabel="count", xlabel="POI")
It is clear that we have to deal with class imbalance, our dataset lacks of cases with POIs that have true value.
The following function is the main univariate plotting function
'''
A function to plot univariate statistics and the distribution of a feature
'''
def univariate_plot(feature, feature_str, num_of_bins=30, custom_vline_length=40):
f, (ax1, ax2, ax3) = plt.subplots(3, 1, sharex=True)
sns.set(style="white", context="talk")
sns.set(style="ticks")
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(font_scale=1.3)
sns.distplot(df[feature], color="green", kde=False, ax=ax1)
ax1.set(ylabel="count", xlabel=feature, title= "'" + feature_str + "' Univariate Plots")
ax1.vlines(x=np.median(df[feature]), ymin=0, ymax=custom_vline_length, color="orange")
ax1.vlines(x=float(stats.mode(df[feature], nan_policy='omit')[0]), ymin=0, ymax=custom_vline_length, color="purple")
ax1.vlines(x=np.round(np.mean(df[feature]), 3), ymin=0, ymax=custom_vline_length, color="red")
ax1.vlines(x=np.percentile(df[feature], 25), ymin=0, ymax=custom_vline_length, linestyles="dotted")
ax1.vlines(x=np.percentile(df[feature], 75), ymin=0, ymax=custom_vline_length, linestyles="dotted")
sns.boxplot(x=feature, data= df.dropna(subset=[feature]), ax=ax2)
sns.swarmplot(x=feature, data= df.dropna(subset=[feature]), color=".35", ax=ax3)
print("min:", np.min(df.salary))
print("25th percentile:", np.percentile(df[feature], 25))
print("median (orange color):", np.median(df.dropna(subset=[feature])[feature]))
print("mode (purple color):", float(stats.mode(df[feature], nan_policy="omit")[0]))
print("mean (red color):", np.round(np.mean(df[feature]), 3))
print("std deviation: ", np.round(np.std(df[feature]), 3))
print("75th percentile:", np.percentile(df[feature], 75))
print("IQR range:", np.percentile(df[feature], 75) - np.percentile(df[feature], 25))
print("max:", np.max(df[feature]))
print("skewness:", stats.skew(df[feature]))
print("kurtosis:", stats.kurtosis(df[feature]))
'''
A function to plot univariate statistics and the distribution of a feature which is log transformed
'''
def log_transformed_univariate_plot(feature, feature_str):
f, (ax1, ax2, ax3) = plt.subplots(3, 1, sharex=True)
sns.set(style="white", context="talk")
sns.set(style="ticks")
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(font_scale=1.3)
sns.distplot(df.dropna(subset=[feature])[feature], color="green", kde=False, ax=ax1).set_title(feature_str + " distribution")
ax1.set(ylabel="count")
sns.boxplot(x=feature, data= df.dropna(subset=[feature]), ax=ax2).set_title(feature_str + " Boxplot")
sns.swarmplot(x=feature, data= df.dropna(subset=[feature]), color=".35", ax=ax3)
ax1.set(xscale="log")
ax2.set(xscale="log")
ax3.set(xscale="log")
univariate_plot(feature = "salary", feature_str = "Salary", num_of_bins = 50, custom_vline_length = 65)
The "salary" feature distribution is very thin, it has a leptokurtic distribution and 3 outliers.
univariate_plot("bonus", "Bonus", 30, 80)
The "Bonus" feature has a right-skewed distribution (skewness>3) and is very leptokurtic (kurtosis=14). Moreover it exhibits many outliers based on the boxplot.
univariate_plot("long_term_incentive", "Long Term Incentive", 30, 90)
The "Long Term Incentive" feature has a right-skewed distribution (skewness>3) and is very leptokurtic (kurtosis=19). Moreover it exhibits many outliers based on the boxplot.
univariate_plot("deferred_income", "Deferred Income", 30, 120)
The "Long Term Incentive" feature has a left-skewed distribution (skewness<0) and is very leptokurtic (kurtosis=17). Moreover it exhibits many outliers based on the boxplot.
univariate_plot("deferral_payments", "Deferral Payments", 30, 120)
The "Deferral Payments" feature has a right-skewed distribution (skewness>3) and is very leptokurtic (kurtosis=33). Moreover it presents many outliers based on the boxplot.
univariate_plot("loan_advances", "Loan Advances", 10, 150)
The "Loan Advances" feature is a very small distribution, however due to its outliers, it has a right-skewed distribution (skewness>3) and is very leptokurtic (kurtosis=138).
univariate_plot("other", "Other", 30, 120)
The "Other" financial feature has a right-skewed distribution (skewness=6) and is very leptokurtic (kurtosis=52). Moreover it presents many outliers based on the boxplot.
univariate_plot("expenses", "Expenses", 30, 80)
The "Expenses" financial feature has a right-skewed distribution (skewness=1.7) and is leptokurtic (kurtosis=3.7). Moreover it presents many outliers based on the boxplot. Furthermore, it has 4 outliers.
univariate_plot("director_fees", "Director Fees", 30, 150)
The "Director Fees" financial feature has a right-skewed distribution (skewness=2.9) and is very leptokurtic (kurtosis=7.15). In addition it presents many outliers based on the boxplot.
univariate_plot("total_payments", "Total Payments", 30, 120)
The "Total Payments" financial feature is a compact / stacked distribution, it has a right-skewed distribution (skewness=10) and is very leptokurtic (kurtosis=118). In addition it presents some outliers based on the boxplot.
univariate_plot("exercised_stock_options", "Exercised Stock Options", 30, 80)
The "Exercised Stock Options" financial feature has a right-skewed distribution (skewness=4.5) and is very leptokurtic (kurtosis=23.5). In addition it presents many outliers based on the boxplot.
univariate_plot("restricted_stock", "Restricted Stock", 20, 55)
The "Restricted Stock" financial feature has a right-skewed distribution (skewness=4.79) and is very leptokurtic (kurtosis=23.5). In addition it presents many outliers based on the boxplot.
univariate_plot("restricted_stock_deferred", "Restricted Stock Deferred", 20, 150)
The "Restricted Stock Deferred" financial feature has a right-skewed distribution (skewness=11.75) and is very leptokurtic (kurtosis=134). In addition it presents 2 outliers based on the boxplot.
univariate_plot("total_stock_value", "Total Stock Value", 10, 70)
The "Total Stock Value" financial feature has a right-skewed distribution (skewness=11.75) and is very leptokurtic (kurtosis=134). In addition it presents 2 outliers based on the boxplot.
univariate_plot("from_poi_to_this_person", "From Poi To This Person", 35, 75)
The "From Poi To This Person" email feature has a right-skewed distribution (skewness=3.49) and is very leptokurtic (kurtosis=16.23). Moreover it presents many outliers based on the boxplot.
univariate_plot("from_messages", "From Messages", 25, 150)
The "From Messages" email feature has a right-skewed distribution (skewness=7.23) and is very leptokurtic (kurtosis=61.86). Moreover it presents many outliers based on the boxplot.
univariate_plot("from_this_person_to_poi", "From this person to poi", 35, 115)
The "From this person to poi" email feature has a right-skewed distribution (skewness=5.089) and is very leptokurtic (kurtosis=27). Moreover it presents many outliers based on the boxplot.
univariate_plot("to_messages", "To messages", 40, 70)
The "To messages" email feature has a right-skewed distribution (skewness=3.8) and is very leptokurtic (kurtosis=18.26). Moreover it presents many outliers based on the boxplot.
univariate_plot("shared_receipt_with_poi", "Shared Receipt With Poi", 40, 70)
The "Shared Receipt With Poi" email feature has a right-skewed distribution (skewness=2.07) and is leptokurtic (kurtosis=4.08). Moreover it exhibits many outliers based on the boxplot.
Overall, all distributions for all the features shows a right-skewed distribution and have a leptokurtic form.
In this chapter we will investigate the relantionshio between 2 features.
Lets investigate the correlation between features in order to find patterns between their relentionship. I use spearman and not pearson correlation coefficient, because I want to evaluate the monotonic relationship between two continuous or ordinal variables. In a monotonic relationship, the variables tend to change together, but not necessarily at a constant rate. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data. However, the Pearson correlation evaluates the linear relationship between two continuous variables. A relationship is linear when a change in one variable is associated with a proportional change in the other variable. Due to fact that we can build machine learning models that are work with non linear relationships between data, I will use spearman correlation coefficient to visualize the monotonic relationship between 2 features.
''' list of features that will be used for correlation:
Salary
Bonus
Long Term Incentive
Deferred Income
Deferral Payments
Loan Advances
Other
Expenses
Director Fees
Total Payments
Exercised Stock Options
Restricted Stock
Restricted Stock Deferred
Total Stock Value
from_poi_to_this_person
from_messages
from_this_person_to_poi
shared_receipt_with_poi
to_messages
'''
sns.set(rc={'figure.figsize':(20.7,18.27)})
sns.set(font_scale=1.5)
sns.heatmap(df[["poi", "salary", "bonus", "long_term_incentive", "deferred_income", "deferral_payments", "loan_advances", "other", "expenses", "director_fees",
"total_payments", "exercised_stock_options", "restricted_stock", "restricted_stock_deferred", "total_stock_value",
"from_poi_to_this_person", "from_messages", "from_this_person_to_poi", "shared_receipt_with_poi", "to_messages"]].corr(method="spearman"),
cmap="RdYlBu", annot=True, fmt=".2f").set_title("Pearson Correlation Heatmap")
plt.show()
#sns.set(font_scale=1)
The above corellation heatmap was design to identify if there are any strong monotonic relationships between the features of the dataset. If the values are high, then probably we can choose to keep one of those variables in the model building process. Seems like some of features have spearman correlation more than 0.7 with each other. The above plots helped us in identifying the important individual variables which are correlated with target. However we generally build many non-linear models via sklearn. So let us build some non-linear models and get variable importance from them.
### head and tails of the top positive and negative correlation between features
corrs = df[["poi", "salary", "bonus", "long_term_incentive", "deferred_income", "deferral_payments", "loan_advances", "other", "expenses", "director_fees",
"total_payments", "exercised_stock_options", "restricted_stock", "restricted_stock_deferred", "total_stock_value",
"from_poi_to_this_person", "from_messages", "from_this_person_to_poi", "shared_receipt_with_poi", "to_messages"]].corr(method="spearman").unstack().sort_values().drop_duplicates()
print(corrs.tail(11))
print(corrs.head(10))
'''
The main bivariate function to plot either the boxplot or regplot relationship between 2 features
'''
def bivariate_plot(feature1, feature2, type_of_plot, feature1_str, feature2_str):
sns.set(rc={'figure.figsize':(11.7, 9.27)})
sns.set(font_scale=1.4)
f, ax = plt.subplots()
if type_of_plot is "regplot":
sns.regplot(x=feature1, y=feature2, data=df, ax=ax).set_title(feature1_str + " vs " + feature2_str + " Scatterplot")
elif type_of_plot is "boxplot":
sns.boxplot(y=feature1, x=feature2, data=df, orient="h", ax=ax).set_title(feature1_str + " vs " + feature2_str + " Boxplot")
sns.swarmplot(y=feature1, x=feature2, data=df, color=".35", orient="h", ax=ax)
else:
print("please specify the type_of_plot parameter either regplot or boxplot")
f.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=None)
bivariate_plot("poi", "salary", "boxplot", "POI", "Salary")
Based on the plot above, the non-poi people have lower salaries than those who are POI.
bivariate_plot("poi", "bonus", "boxplot", "POI", "Bonus")
Same as the salaries, and the Bonus feature, the employees who are POIs have higher bonuses than those who are not.
bivariate_plot("poi", "total_stock_value", "boxplot", "POI", "Total Stock Value")
Same as before those who are POIs have larger Total Stock Value than those who are not.
bivariate_plot("poi", "exercised_stock_options", "boxplot", "POI", "Exercised Stock Options")
Again, those who are POIs have larger Exercised Stock Option than those who are not.
bivariate_plot("poi", "shared_receipt_with_poi", "boxplot", "POI", "Shared Receipt with Poi")
Same as before those who are POIs (POI=true) have wider shared receipt with poi than those who are not.
bivariate_plot("poi", "deferred_income", "boxplot", "POI", "Deferred Income")
The boxplots distributions between Deferred Income and POIs show a exhibit different distribution in respect of POIs, in fact those who are POIs have a wider distribution than those who are not. However both distributions shows lots of outliers. We can say that there is a difference in deferred Income for those who are POIs and those who are not.
bivariate_plot("exercised_stock_options", "total_stock_value", "regplot", "Exercised Stock Options", "Total Stock Value")
Both features show a positive correlation between them.
bivariate_plot("bonus", "salary", "regplot", "Bonus", "Salary")
Same as before, both features show a positive correlation between them.
bivariate_plot("to_messages", "shared_receipt_with_poi", "regplot", "To messages", "Shared Receipt with POI")
Same as before, both features show a positive correlation between them.
bivariate_plot("shared_receipt_with_poi", "from_poi_to_this_person", "regplot", "Shared receipt with POI", "From POI to this Person")
Likewise, both features show a positive correlation between them.
bivariate_plot("bonus", "total_payments", "regplot", "Bonus", "Total Payments")
Again, both features show a positive correlation between them.
bivariate_plot("total_payments", "salary", "regplot", "Total Payments", "Loan Advances")
Same as before, both features show a positive correlation between them.
bivariate_plot("from_messages", "to_messages", "regplot", "From Messages", "To messages")
Likewise, both features show a positive correlation between them.
bivariate_plot("other", "total_payments", "regplot", "Other", "Total Payments")
Same as before, both features show a positive correlation between them.
bivariate_plot("to_messages", "from_poi_to_this_person", "regplot", "To Messages", "From Poi To This Person")
Again, both features show a positive correlation between them.
bivariate_plot("director_fees", "total_stock_value", "regplot", "Director Fees", "Total Stock Value")
Lets turn to the negative correlations, both features show a negative correlation between them.
bivariate_plot("director_fees", "restricted_stock", "regplot", "Director Fees", "Restricted Stock")
Again, both features show a negative correlation between them.
bivariate_plot("director_fees", "salary", "regplot", "Director Fees", "Salary")
Again, both features show a negative correlation between them.
bivariate_plot("director_fees", "bonus", "regplot", "Director Fees", "Bonus")
Same as before, both features show a negative correlation between them.
bivariate_plot("other", "director_fees", "regplot", "Other", "Director Fees")
Likewise, both features show a negative correlation between them.
bivariate_plot("long_term_incentive", "director_fees", "regplot", "Long Term Incentive", "Director Fees")
Again, both features show a negative correlation between them.
bivariate_plot("exercised_stock_options", "director_fees", "regplot", "Exercised Stock Options", "Director Fees")
Same as before, both features show a negative correlation between them.
bivariate_plot("total_payments", "director_fees", "regplot", "Total Payments", "Director Fees")
Ones more, both features show a negative correlation between them although it is not apparent in the plot above, based on the spearman correlation there is a negative relation between them.
bivariate_plot("bonus", "deferred_income", "regplot", "Bonus", "Deferred Income")
Ones more, both features show a negative correlation between them.
In this chapter the top 10 positive and negative correlations will be redrawn but now we will add an extra feature the POIs which are the feature of interest / target variable to find out if there is some special patterns. It will be a great opportunity to depict not only the relations between 2 features but how the POI target variable react to these relations.
'''
The main bivariate function to plot either the boxplot or regplot relationship between 2 features
'''
def multivariate_plot(feature1, feature2, feature1_str, feature2_str):
sns.set(rc={'figure.figsize':(11.7, 9.27)})
sns.set(font_scale=1.4)
f, ax = plt.subplots()
sns.scatterplot(x=feature1, y=feature2, hue="poi", size="poi", sizes=(200, 200), data=df, ax=ax).set_title("Multivariate Plot "+ feature1_str +" and "+ feature2_str +" in respect of POI")
multivariate_plot("exercised_stock_options", "total_stock_value", "Exercised Stock Options", "Total Stock Value")
Here it seems that as long as both the Exercised Stock Options and Total Stock Value increases, then their largest points belong to POIs individuals.
multivariate_plot("bonus", "salary", "Bonus", "Salary")
Here it seems that there is pattern, that is, as long as both the Bonus and Salary, then their largest points belong to POIs individuals.
multivariate_plot("to_messages", "shared_receipt_with_poi", "To messages", "Shared Receipt with POI")
I can not say with great confidence that I find a pattern here,but it seems that as long as To Messages and Shared Receipt with POI, then their largest points belong to POIs individuals.
multivariate_plot("shared_receipt_with_poi", "from_poi_to_this_person", "Shared receipt with POI", "From POI to this Person")
Once more, I can not say with great confidence that I find a pattern here, but it seems that as long as the feature Shared Receipt With POI increases and "From POI to this Person" increases, then their largest points are from POIs individuals.
multivariate_plot("bonus", "total_payments", "Bonus", "Total Payments")
Once again, I can not safely say that I find a pattern here, but it seems that as long as the "Bonus" feature increase and "Total Payments" increases, then their largest points are from POIs individuals.
multivariate_plot("total_payments", "salary", "Total Payments", "Loan Advances")
Once again, I can not safely say that I find a pattern here, generally speaking, it seems that with constant Total Payments and as long as the salary increases then there a shift from Non-POIs to POIs.
multivariate_plot("from_messages", "to_messages", "From Messages", "To messages")
Once again, I can not safely say that I find a pattern here, generally speaking, it seems that with constant or minor increases in "From_Messages" feature and as long as the the email feature "To messages" increases then there a shift from Non-POIs to POIs.
multivariate_plot("other", "total_payments", "Other", "Total Payments")
Here we see that there is a pattern which is that as long the the features "Other" and "Total Payments" increse then there is a shift from non-POIs to POIs.
multivariate_plot("to_messages", "from_poi_to_this_person", "To Messages", "From Poi To This Person")
Here I am not confident enough to reach to a conclusion whether there is a pattern or not. I can say that there is a pattern which is that as long the the features "To Messages" and "From POI to this Person" increse then there is a shift from non-POIs to POIs.
multivariate_plot("director_fees", "total_stock_value", "Director Fees", "Total Stock Value")
Here I am not confident enough to reach to a conclusion whether there is a pattern or not. I may say that as long as the feature "Director Fees" remain tha same and the feature "Total Stock Value" incrases then there is a shift from Non-POIs to POIs.
multivariate_plot("director_fees", "restricted_stock", "Director Fees", "Restricted Stock")
Here I am not confident enough to reach to a conclusion whether there is a pattern or not. I may say that as long as the feature "Director Fees" remain tha same and the feature "Restricted Stock" incrases then there is a shift from Non-POIs to POIs.
multivariate_plot("director_fees", "salary", "Director Fees", "Salary")
Here I am not confident enough to reach to a conclusion whether there is a pattern or not. I may say that as long as the feature "Director Fees" remain tha same and the feature "Salary" incrases then there is a shift from Non-POIs to POIs.
multivariate_plot("director_fees", "bonus", "Director Fees", "Bonus")
Here I am not confident enough to reach to a conclusion whether there is a pattern or not. I may say that as long as the feature "Director Fees" remain tha same and the feature "Bonus" incrases then there is a shift from Non-POIs to POIs.
multivariate_plot("other", "director_fees", "Other", "Director Fees")
Here I am not confident enough to reach to a conclusion whether there is a pattern or not. I may say that as long as the feature "Director Fees" remain tha same and the feature "Salary" incrases then there is a shift from Non-POIs to POIs.
multivariate_plot("long_term_incentive", "director_fees", "Long Term Incentive", "Director Fees")
Here I am not confident enough to reach to a conclusion whether there is a pattern or not. I may say that as long as the feature "Director Fees" remain tha same and the feature "Long Term Incentive" incrases then there is a shift from Non-POIs to POIs.
multivariate_plot("exercised_stock_options", "director_fees", "Exercised Stock Options", "Director Fees")
Based on the plot above, I am not sure enough to reach to a conclusion whether there is a pattern or not. I can say that as long as the feature "Director Fees" remains tha same and close to 0 and the feature "Exercised Stock Options" increases then there is a shift from Non-POIs to POIs values.
multivariate_plot("total_payments", "director_fees", "Total Payments", "Director Fees")
Based on the plot above, I am not confident enough to reach to a conclusion whether there is a pattern or not. I may say that as long as the feature "Director Fees" remains tha same and the feature "Total Payments" increases then there is a shift from Non-POIs to POIs.
multivariate_plot("bonus", "deferred_income", "Bonus", "Deferred Income")
Here we have an interesting pattern, as long as the "Bonus" feature increases and the "Deferred Income" decreases, we see that there is a change in the POI feature from non-POIs to POIs.
I will start looking for outlier data points in all the features. I have to be careful in terms of removing the outliers because the dataset is rather small for machine learning in the first place. Moreover, the outliers might actually be important as they could represent patterns in the data that would aid in the identification of a POI.
The definition of a mild outlier is either below the (first quartile minus 1.5 times the Interquartile Range (IQR)) or above the (third quartile plus 1.5 times the IQR):
low outlier<first quartile−1.5 x IQR
high outlier>third quartile+1.5 * IQR
My approach will be to count the number of outlying features for each individual. My personal threshold is individuals who have outliers in more than 5 features in the dataset. I will then investigate the persons with the highest number of outliers to determine if they need to be removed.
df_without_outliers = df[["salary", "bonus", "long_term_incentive", "deferred_income", "deferral_payments", "loan_advances", "other", "expenses", "director_fees",
"total_payments", "exercised_stock_options", "restricted_stock", "restricted_stock_deferred", "total_stock_value",
"from_poi_to_this_person", "from_messages", "from_this_person_to_poi", "shared_receipt_with_poi", "to_messages"]]
IQR = df_without_outliers.quantile(q=0.75) - df_without_outliers.quantile(q=0.25)
first_quartile = df_without_outliers.quantile(q=0.25)
third_quartile = df_without_outliers.quantile(q=0.75)
outliers_features = df_without_outliers[(df_without_outliers>(third_quartile + 1.5*IQR) ) | (df_without_outliers<(first_quartile - 1.5*IQR) )]
outliers_features['name'] = df['name']
outliers_features['poi'] = df['poi']
#outliers_counter = df_without_outliers[(df_without_outliers>(third_quartile + 1.5*IQR) ) | (df_without_outliers<(first_quartile - 1.5*IQR) )].count(axis=1)
#outliers_counter.sort_values(axis=0, ascending=False, inplace=True)
# find outliers' names where they are outliers in more than 5 features, here the threshold is 7 because
# we add 5+2 where 2 are the poi and name features and 5 is my personal threshold, that means that if a case
# has in 5 or more features outliers then he is generally an outlier and the case must be investigated.
biggest_outliers = outliers_features[outliers_features.notna().sum(axis=1).sort_values(ascending=False) > 7]
outliers_counter = biggest_outliers.count(axis=1).sort_values(ascending = False)
biggest_outliers.loc[outliers_counter.index].name
These individuals have the most of the outliers in more than 5 features in the dataset.
# generic function to inspect an outlier:
def inspect_the_outlier(outlier_str):
print("number of features that", outlier_str, "is an outlier:", len(biggest_outliers.loc[biggest_outliers[biggest_outliers.name == outlier_str].notna().index].dropna(axis=1).columns)-2)
print()
print(biggest_outliers.loc[biggest_outliers[biggest_outliers.name == outlier_str].notna().index].dropna(axis=1).columns)
print()
print(outlier_str, " outlier feature values:")
print(biggest_outliers.loc[biggest_outliers[biggest_outliers.name == outlier_str].notna().index].dropna(axis=1))
As this point, I have to perform some research before deleting outliers. Based on the small number of persons of interest in the dataset, I decided not to remove any individuals who are POIs regardless of their number of outliers. An outlier for a POI could be a sign of fradulent activity, I will manually examine several of the top outlying inviduals to see if I can find any insights and to determine if will be removed.
Lay Kenneth, wiki source, the CEO of Enron from 1986–2001, took part in many of the illegal business activites and hence is one of the most vital persons of interest.
inspect_the_outlier("LAY KENNETH L")
According to this site, Mark Frevert, chairman and CEO of Enron Wholesale Services, was named vice chairman of Enron. Although not a person of interest. I believe that based on his bussiness bio source, he did not committed in fraudulent transactions and thus he is not a POI and because he is an outlier, he will remove him from the dataset.
inspect_the_outlier("FREVERT MARK A")
inspect_the_outlier("SKILLING JEFFREY K")
Timothy Belden, wiki source Timothy Belden (born 1967) is the former head of trading in Enron Energy Services. He is considered the mastermind of Enron's scheme to drive up California's energy prices, by developing many of the trading strategies that resulted in the California electricity crisis. Belden pleaded guilty to one count of conspiracy to commit wire fraud as part of a plea bargain, along with his cooperation with authorities to help convict many top Enron executives. He is a vital person of interest and thus he will remain in the dataset.
inspect_the_outlier("BELDEN TIMOTHY N")
John Lavorato, was a top executive in the energy-trading branch of Enron and received large bonuses to keep him from leaving Enron. As he was not a person of interest, and the large bonus he got it was just to stay and not leave the Enron Corp. based on this article, I think it would be appropriate to remove him from the dataset.
inspect_the_outlier("LAVORATO JOHN J")
inspect_the_outlier("HAEDICKE MARK E")
Richard Buy, Enron's chief risk officer, he invoked the Fifth Amendment and refused to answer questions in the court of Law. I believe he is not a POI and he is an outlier and I will remove from dataset.
inspect_the_outlier("BUY RICHARD B")
John Baxter, was a former vice Enron vice chairman and died of an apparent self-inflicted gunshot before he was able to testify against other Enron executives. I will remove him from the dataset as he is not a person of interest.
inspect_the_outlier("BAXTER JOHN C")
He was investigated thoroughly (source) but not identified as a person of interest and therefore will be removed from the dataset.
inspect_the_outlier("WHALLEY LAWRENCE G")
Kean Steven former chief of staff in Enron corporation was one of the dicent employees, based on this article he wanted afer the Skilling's resignation to turn the course of company focusing more on bussiness rather than dealing with financial transactions, hence I believe that he is an outlier and must be removed from dataset.
inspect_the_outlier("KEAN STEVEN J")
Louise Kitchen was a young British trader spearheading Enron's entry into Europe's energy markets. She wasn't a top executive and hadn't even turned 30. But Kitchen cooked up a plan for the company's online trading operation, that is why it must be remained on the dataset. Although she is not a POI and she is an outlier due to her interesting record during her career at Enron she must remain in the dataset. source
inspect_the_outlier("KITCHEN LOUISE")
A former Enron executive, David W. Delainey, was sentenced to two and a half years in prison yesterday after prosecutors urged leniency, saying he helped unravel crimes at the energy company source. So he will stay in the dataset regardless being an outlier.
inspect_the_outlier("DELAINEY DAVID W")
Phillip K. Allen, who worked on the gas desk on the West Coast, received huge bonuses fron Enron Corp. Based on his profile little evidence have been found to show that he is a POI and being an outlier, he must be removed from dataset.
inspect_the_outlier("ALLEN PHILLIP K")
# Remove the outlier individuals
df_without_outliers = df[(df.name != "FREVERT MARK A") & (df.name != "LAVORATO JOHN J") &
(df.name != "BUY RICHARD B") & (df.name != "BAXTER JOHN C") &
(df.name != "HAEDICKE MARK E") & (df.name != "KEAN STEVEN J") &
(df.name != "WHALLEY LAWRENCE G")]
# Find the number of poi and non poi now in the data after droping the outliers
df_without_outliers['poi'].value_counts()
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well (look at the below snapshot) source
Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees source.
In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong independence assumptions between the features source
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set source.
AdaBoost, short for Adaptive Boosting, is a machine learning meta-algorithm formulated by Yoav Freund and Robert Schapire, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems it can be less susceptible to the overfitting problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner source.
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique source.
As the experiments proceed, I will narrow down the number of machine learning algorithms that are suitable for the dataset.
Precision is the number of correct positive classifications divided by the total number of positive labels assigned. Or we can say, it is the fraction of persons of interest predicted by the algorithm that are truly persons of interest. Mathematically precision is defined as precision=true positives / (true positives+false positives)
Recall is the number of correct positive classifications divided by the number of positive instances that should have been identified. In other words, it is the fraction of the total number of persons of interest in the data that the classifier identifies. Mathematically, recall is defined as recall=true positives / (true positives+false negatives)
A combined measured of precision and recall is the F1 score. Is it the harmonic mean of precision and recall. Mathematically, the F1 score is defined as:
F1 Score=2 (precision x recall) / (precision+recall)
For this project, the objective was a precision and a recall both greater than 0.3. However, I believe it is possible to do much better than that with the right feature selection and algorithm tuning. For the majority of my tuning and optimization using GridSearchCV, I will use the F1 score because it takes into account both the precision and recall.
So summarizing all the validation metrics:
The only data preparation I will do for initial testing of the algorithms is to scale the data such that it has a zero mean and a unit variance. This process is called normalization and is accomplished using the scale function from the sklearn preprocessing module. Scaling of some form (whether that is MinMax scaling or normalization) is usually necessary because there are different units for the features in the dataset. Scaling creates non-dimensional features so that those features with larger units do not have an undue influence on the classifier as would be the case if the classifier uses some sort of distance measurement (such as Euclidean distance) as a similarity metric.
I have to validate my classifiers using train/test splits. However our dataset is very small and we are facing the problem of class imbalance. Training and testing the classifier against the same data will only lead to overfitting. This is why validation is very important. By using StratifiedShuffleSplit from sklean to split our data into training and testing data, we can make sure that our classes are allocated by the same ratio set for training/testing and that each datapoint in the class is randomly selected. Because of our small dataset, setting the iterations to 1000 will give us more reliable results in the end, as we will have trained and tested on almost all of our datapoints. The only downside is the run time. By running 1000 the StratifiedShuffleSplit, 1000 confusion matrices will be produced, In order to conclude to tangible results from 1000 confusion matrices I will add them and then each row will be devided by the number of non-POIs and POIs from all the 1000 splits respectively.
Furthermore, due to the 1000 stratified shuffle split we will have to average the sum of all the evaluation metrics like Accuracy, Precision, Recall and F1 score. Hence we will produce the average Accuracy, Precision, Recall and F1 score and as mentioned in the previous paragragh the normalized, average confusion matrix after 1000 stratified shuffle splits.
I would like to talk about my intuition and steps for my experiments. Previously I mentioned some Machine learning related terms such as Hyperparameter Tuning, Feature Engineering and Feature Selection without any explanation, let me extend a bit more these concepts.
Hyperparameters refer to the machine learning models' parameters that cannot be directly learned from the training process and need to be predefined. At first, every machine learning method in sklearn is has been set with default hyperparamteres. These hyperparameters define higher level concepts about the model such as complexity, capacity to learn, rate of convergence, penalty etc. The optimal hyperparameters lead to better efficiency, faster convergence and better results overall. In addition, we need to select an optimization strategy that will find us the best set of hyperparameters. There are several ways that are used extensively in today’s world to find best hyperparameters such as Grid Search, Random Search and Bayesian Search source.
Feature engineering is the process that it is considered essential in machine learning. Using domain knowledge of the data to create features that make machine learning algorithms perform even better. Feature engineering is fundamental to the application of machine learning. In general, on a very basic level, feature engineering is the creation of features relevant to a specific machine learning algorithm and domain. Feature functions can be thought of as composites of variables that can help quantify the relationships between inputs, variables, or values specific to a given domain. source1, source2.
Feature selection is useful on a variety of fronts: it is the best weapon against the Curse of Dimensionality; it can reduce overall training times; and it is a powerful defense against overfitting, increasing generalizability source.
Last but not least is the random seed which is required for almost all ML models and stratified shuffle split, I will use the as random seed the value 42 for all my experiments.
from sklearn.preprocessing import scale
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
import time
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
df_without_outliers_names_and_email_addresses = df_without_outliers.drop(['name', "email_address", "poi"], axis=1)
# Scale the dataset and send it back to a dictionary
scaled_df = df_without_outliers_names_and_email_addresses
scaled_df = scale(scaled_df)
scaled_df = pd.DataFrame(scaled_df)
scaled_df.columns = df_without_outliers_names_and_email_addresses.columns
#scaled_df["poi"] = df_without_outliers["poi"]
scaled_df.insert(loc=0, column='poi', value=df_without_outliers["poi"].values)
X = np.array(scaled_df.drop(["poi"], axis=1))
y = np.array(scaled_df["poi"])
## SVM evaluation
start_time = time.time()
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state = 42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) <-- Its no use, since leads to imbalanced splits
clf = SVC(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For SVM the accuracy score is", np.mean(accuracy_scores))
print("For SVM the precision score is", np.mean(precision_scores))
print("For SVM the recall score is", np.mean(recall_scores))
print("For SVM the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
print("\n elapsed time", time.time() - start_time, "seconds")
SVM classifier providing with default parameters and initial default features is not a good fit to model this dataset to predict POIs. Its precision is ver low almost close to zero altough it can accurately predict the Non POIs, however it performs poorly on predicting POIs.
## DecisionTreeClassifier validation
start_time = time.time()
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For DecisionTreeClassifier the accuracy score is", np.mean(accuracy_scores))
print("For DecisionTreeClassifier the precision score is", np.mean(precision_scores))
print("For DecisionTreeClassifier the recall score is", np.mean(recall_scores))
print("For DecisionTreeClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
print("\n elapsed time", time.time() - start_time, "seconds")
DecisionTree Classifier, is a model that produces good results with default ML parameters and using the initial features. Both the precision and recall are more than 0.4 and has an excellent accuracy.
#### GaussianNB
start_time = time.time()
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = GaussianNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For GaussianNB the accuracy score is", np.mean(accuracy_scores))
print("For GaussianNB the precision score is", np.mean(precision_scores))
print("For GaussianNB the recall score is", np.mean(recall_scores))
print("For GaussianNB the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
print("\n elapsed time", time.time() - start_time, "seconds")
Gaussian Naive Bayes shows some interesting results; it has average accuracy after 1000 stratified shuffle splits which is close to 0.6 and unfortunately its average precision is below 0.3. I believe that due to its Naive assumption and due to the fact that we have continuous data Gaussian Naive Bayes perform poorly. I will check this ML model in the future when I introduce engineered features.
#### Random Forest
start_time = time.time()
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For RandomForestClassifier the accuracy score is", np.mean(accuracy_scores))
print("For RandomForestClassifier the precision score is", np.mean(precision_scores))
print("For RandomForestClassifier the recall score is", np.mean(recall_scores))
print("For RandomForestClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
print("\n elapsed time", time.time() - start_time, "seconds")
RandomForestClassifier is an metaclassifier, it is an ensemble of multiple decision trees, here the classifier has great average accuracy and average precision, however the average recall does not meet our standards, it is almost below 0.3.
#### Adaboost
start_time = time.time()
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = AdaBoostClassifier(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For AdaBoostClassifier the accuracy score is", np.mean(accuracy_scores))
print("For AdaBoostClassifier the precision score is", np.mean(precision_scores))
print("For AdaBoostClassifier the recall score is", np.mean(recall_scores))
print("For AdaBoostClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
print("\n elapsed time", time.time() - start_time, "seconds")
Adaboost is an another ensemble technique in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Here for this dataset all the average measures are above my expectations (which is above 0.3) and I am talking about the average precision and recall. I have high hopes for adaboost and I will definately use it in the future experiments!
#### kNN
start_time = time.time()
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For KNeighborsClassifier the accuracy score is", np.mean(accuracy_scores))
print("For KNeighborsClassifier the precision score is", np.mean(precision_scores))
print("For KNeighborsClassifier the recall score is", np.mean(recall_scores))
print("For KNeighborsClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
print("\n elapsed time", time.time() - start_time, "seconds")
The the k-nearest neighbors classifier with default parameters and with the initial number of features shows poor results. I almost misclassify all the POI cases as non-POI based on the normalized confusion matrix and average recall and precision is way below zero.
Lets gather are results from the average metrics for the 6 ML models that were built to have a clear image for their performance.
Evaluation Metric | SVM | Decision Tree Classifier | Gaussian Naive Bayes | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|---|
Accuracy | 0.8558 | 0.837 | 0.606 | 0.871 | 0.873 | 0.840 |
Precision | 0.0317 | 0.434 | 0.265 | 0.604 | 0.588 | 0.123 |
Recall | 0.0055 | 0.400 | 0.657 | 0.280 | 0.447 | 0.037 |
F1 measure | 0.0093 | 0.396 | 0.355 | 0.358 | 0.484 | 0.054 |
it is clear that the best performed classifiers are the Decision Tree Classifier and the Adaboost based on the their average F1, accuracy, precision and recall. I have high expectations for them in the future and I will definately use them for future experiments. Morever for the rest of the ML models I will continue using them in my experiments in order to see if their performance metrics increase.
Except for GaussianNB all the other machine learning models can be tuned to improve their performance. For tuning I will use the method GridSearchCV and 10 Stratified Shuffle Splits during tuning and finally for validating the model for the usual 1000 stratified Shuffle Splits.
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### tune and validate model
## SVM tuning
param_grid = {'kernel':['rbf','linear', 'sigmoid'],
'gamma':[1e-4,1e-3,1e-2,1e-1,1,1e+1,1e+2,1e+3,1e+4],
'C':[1e-4,1e-3,1e-2,1e-1,1,1e+1,1e+2,1e+3,1e+4]}
grid = GridSearchCV(SVC(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## SVM validation
print("For SVM, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = SVC(C=10.0, gamma=0.001, kernel="rbf", random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For SVM accuracy score is", np.mean(accuracy_scores))
print("For SVM the precision score is", np.mean(precision_scores))
print("For SVM the recall score is", np.mean(recall_scores))
print("For SVM the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
After tuning I can say that SVM improved their precision and passed the threshold of 0.3. However its recall is bellow that limit. Moreover, based on the normalized confusion matrix it stills misclassify all the POI cases as non-POI.
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
## DecisionTreeClassifier tuning
param_grid = {'criterion':['gini','entropy'],
'max_depth':[2,3,4,5,6,7,8,9,10,15,20,25,30],
'min_samples_split':[2,3,4,5],
'min_samples_leaf':[1,2,3,4],
'max_features' : [None, 'sqrt', 'log2', 'auto']}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## DecisionTreeClassifier validation
print("For DecisionTreeClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3, max_features='sqrt', min_samples_leaf=4, min_samples_split=2, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For DecisionTreeClassifier accuracy score is", np.mean(accuracy_scores))
print("For DecisionTreeClassifier the precision score is", np.mean(precision_scores))
print("For DecisionTreeClassifier the recall score is", np.mean(recall_scores))
print("For DecisionTreeClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
After tuning I can say that Decision Tree Classifier improved their precision even more and reach the value of 0.5. There was a drop in the average recall metric which from 0.4 with default parameters to 0.36 but still pass my personal threshold which is above 0.3
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### validate, tune and evaluate ML models
## RandomForestClassifier tuning
param_grid = {'n_estimators':[10,20,30,40],
'max_depth':[2,3,4,5,6],
'min_samples_split':[2,3,4,5,6],
'min_samples_leaf':[1,2,3,4,5,6]}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## RandomForestClassifier best params
print("For RandomForestClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = RandomForestClassifier(n_estimators=30, max_depth=5, min_samples_leaf=1, min_samples_split=2, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For RandomForestClassifier accuracy score is", np.mean(accuracy_scores))
print("For RandomForestClassifier the precision score is", np.mean(precision_scores))
print("For RandomForestClassifier the recall score is", np.mean(recall_scores))
print("For RandomForestClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
After Tuning RandomForest Classifier, the precision increased by 0.02 but its recall dropped by 0.02. Moreover, it misclassifies the 70% of all the POI cases as non-POI cases. Overall, tuning has not produced any tangible results for random forest classifier.
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### validate, tune and evaluate ML models
## RandomForestClassifier tuning
param_grid = {'n_estimators':[20,30,40,50,60,70,80],
'learning_rate':[1.0, 0.5, 0.1, 0.05, 0.01]}
grid = GridSearchCV(AdaBoostClassifier(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## DecisionTreeClassifier validation
print("For AdaBoostClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
#Adaboost Tuning
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = AdaBoostClassifier(learning_rate=0.1, n_estimators=20, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For AdaBoostClassifier accuracy score is", np.mean(accuracy_scores))
print("For AdaBoostClassifier the precision score is", np.mean(precision_scores))
print("For AdaBoostClassifier the recall score is", np.mean(recall_scores))
print("For AdaBoostClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Tuning the Adaboost classifier brought both positive and negative results, the first one is that the precision from 0.588 to 0.644 but the recall decreased from 0.447 to 0.339. But still Adaboost produces satisfactory results due to the fact that it passes the threshold.
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### validate, tune and evaluate ML models
## KNeighborsClassifier tuning
param_grid = {'n_neighbors':[2,3,4,5,6,7,8,9,10,11,12,13,14],
'weights':['uniform', 'distance'],
'algorithm':['ball_tree', 'kd_tree', 'brute', 'auto']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## DecisionTreeClassifier validation
print("For KNeighborsClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = KNeighborsClassifier(algorithm='ball_tree', n_neighbors=6, weights='uniform')
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For kNN accuracy score is", np.mean(accuracy_scores))
print("For kNN the precision score is", np.mean(precision_scores))
print("For kNN the recall score is", np.mean(recall_scores))
print("For kNN the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
The KNN optimal tuned model with the initial default parameters continues to produce poor results. It misclassifies 99% of all the POIs as non-POIs during the 1000 iterations of stratified sampling.
Evaluation Metric / ML Model | SVM | Decision Tree Classifier | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|
Accuracy | 0.864 | 0.863 | 0.874 | 0.882 | 0.853 |
Precision | 0.368 | 0.500 | 0.622 | 0.642 | 0.036 |
Recall | 0.087 | 0.367 | 0.238 | 0.399 | 0.007 |
F1 measure | 0.137 | 0.386 | 0.377 | 0.464 | 0.012 |
Still the 2 top ML models that continue to give great results is Decision Tree Classifier and Adaboost, the rest of the ML models, although the tuning the some of the improvements they produced, still they cannot pass my personal threshold for haveing values over 0.3 for both the average accuracy and the recall.
So far DecisionTreeClassifier, Adaboost shows better performance than the others. We can improve the existed performance by introducing new features. This concept stems from the fact that our dataset is very small right now, it is small than 200 cases and not to forget the outliers that were removed before and the class imbalance that we are facing. By adding more features we can increase the expressiveness of our dataframe. Lets add more features with feature engineering.
So the next step is to create new features from the existing information that could possibly improve performance.
After thinking about the background of the Enron case and the information to work with contained in the dataset, I decided on three new features to create from the email metadata.
The thinking behind these choices is that the absolute number of emails from or to a POI might not matter so much as the relative number considering the total emails an individual sends or receives. I believe that individuals who interact more with a person of interest (as indicated by emails) are themselves more likely to be a POI.
I will also create new features using the financial data. I believe that people recieving large bonuses may be more likely to be POI because the bonuses could be a result of fraudulent activities. It would be easier to pass off illegal funds as a bonus rather than a salary raise. The two new features will be the following:
# feature engineering from email data
df_without_outliers_names_and_email_addresses_and_feature_engineered = df_without_outliers_names_and_email_addresses.copy()
df_without_outliers_names_and_email_addresses_and_feature_engineered["poi_ratio"] = (df['from_poi_to_this_person'] + df['from_this_person_to_poi']) / (df['to_messages'] + df['from_messages'])
df_without_outliers_names_and_email_addresses_and_feature_engineered["fraction_to_poi"] = (df['from_this_person_to_poi']) / (df['from_messages'])
df_without_outliers_names_and_email_addresses_and_feature_engineered["fraction_from_poi"] = (df['from_poi_to_this_person']) / (df['to_messages'])
# feature engineering from financial data
df_without_outliers_names_and_email_addresses_and_feature_engineered['bonus_to_salary'] = df['bonus'] / df['salary']
df_without_outliers_names_and_email_addresses_and_feature_engineered['bonus_to_total'] = df['bonus'] / df['total_payments']
Lets view the histograms of all the newly created features
# fill na with zero
df_without_outliers_names_and_email_addresses_and_feature_engineered.bonus_to_salary = df_without_outliers_names_and_email_addresses_and_feature_engineered.bonus_to_salary.fillna(0)
df_without_outliers_names_and_email_addresses_and_feature_engineered.bonus_to_salary.plot.hist()
# fill na with zero
df_without_outliers_names_and_email_addresses_and_feature_engineered.bonus_to_total = df_without_outliers_names_and_email_addresses_and_feature_engineered.bonus_to_total.fillna(0)
df_without_outliers_names_and_email_addresses_and_feature_engineered.bonus_to_total.plot.hist()
df_without_outliers_names_and_email_addresses_and_feature_engineered.poi_ratio.plot.hist()
df_without_outliers_names_and_email_addresses_and_feature_engineered.fraction_from_poi.plot.hist()
df_without_outliers_names_and_email_addresses_and_feature_engineered.fraction_to_poi.plot.hist()
All the distributions from the newly created features are right skewed and they must be scaled with MinMaxScaler
# Scale the dataset and send it back to a dictionary
scaled_df = df_without_outliers_names_and_email_addresses_and_feature_engineered
scaled_df = scale(scaled_df)
scaled_df = pd.DataFrame(scaled_df)
scaled_df.columns = df_without_outliers_names_and_email_addresses_and_feature_engineered.columns
#scaled_df["poi"] = df_without_outliers["poi"]
scaled_df.insert(loc=0, column='poi', value=df_without_outliers["poi"].values)
X = np.array(scaled_df.drop(["poi"], axis=1))
y = np.array(scaled_df["poi"])
## SVM evaluation
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) <-- Its no use, since leads to imbalanced splits
clf = SVC(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For SVM the accuracy score is", np.mean(accuracy_scores))
print("For SVM the precision score is", np.mean(precision_scores))
print("For SVM the recall score is", np.mean(recall_scores))
print("For SVM the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Still with the introduction of the new engineered features, SVM performs poorly.
## DecisionTreeClassifier evaluation
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) <-- Its no use, since leads to imbalanced splits
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For DecisionTreeClassifier the accuracy score is", np.mean(accuracy_scores))
print("For DecisionTreeClassifier the precision score is", np.mean(precision_scores))
print("For DecisionTreeClassifier the recall score is", np.mean(recall_scores))
print("For DecisionTreeClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Decision Tree classifier continues to perform very well, the average accuracy, precision, recall and f1 score metrics passed my personal threshold limit which is the 0.3.
## GaussianNB evaluation
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) <-- Its no use, since leads to imbalanced splits
clf = GaussianNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For GaussianNB the accuracy score is", np.mean(accuracy_scores))
print("For GaussianNB the precision score is", np.mean(precision_scores))
print("For GaussianNB the recall score is", np.mean(recall_scores))
print("For GaussianNB the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Unfortunately Gaussian Naive Bayes has low average precision score, it is below my personal threshold which is the 0.3.
## RandomForestClassifier evaluation
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) <-- Its no use, since leads to imbalanced splits
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For RandomForestClassifier the accuracy score is", np.mean(accuracy_scores))
print("For RandomForestClassifier the precision score is", np.mean(precision_scores))
print("For RandomForestClassifier the recall score is", np.mean(recall_scores))
print("For RandomForestClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
RandomForest Classifier, has great average accuracy and precision, however its recall is below 0.3
## AdaBoostClassifier evaluation
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) <-- Its no use, since leads to imbalanced splits
clf = AdaBoostClassifier(random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For AdaBoostClassifier the accuracy score is", np.mean(accuracy_scores))
print("For AdaBoostClassifier the precision score is", np.mean(precision_scores))
print("For AdaBoostClassifier the recall score is", np.mean(recall_scores))
print("For AdaBoostClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Once again, Adaboost classifier, continues to perform pretty well, every average validation metric passed my personal threshold which is above 0.3.
## KNN evaluation
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) <-- Its no use, since leads to imbalanced splits
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For KNeighborsClassifier the accuracy score is", np.mean(accuracy_scores))
print("For KNeighborsClassifier the precision score is", np.mean(precision_scores))
print("For KNeighborsClassifier the recall score is", np.mean(recall_scores))
print("For KNeighborsClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
KNN with the initial and the engineered continues to perform poorly, it misclassifies 88% of the POIs as non-POIs cases based of the normalized confusion matrix.
Evaluation Metric / ML Model | SVM | Decision Tree Classifier | Gaussian Naive Bayes | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|---|
Accuracy | 0.855 | 0.866 | 0.634 | 0.880 | 0.886 | 0.858 |
Precision | 0.001 | 0.550 | 0.275 | 0.675 | 0.652 | 0.421 |
Recall | 0.0005 | 0.529 | 0.607 | 0.283 | 0.525 | 0.115 |
F1 measure | 0.00075 | 0.516 | 0.353 | 0.374 | 0.552 | 0.174 |
I do not want to comment about SVM, Gaussian Naive Bayes, Random Forest and KNN models, although the introduction of the new engineered features they still perform poorly. Moreover, both Decision Tree Classifier and Adaboost Classifier have decreased their performance, however their average recall metric has increased.
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### tune and validate model
## SVM tuning
param_grid = {'kernel':['rbf','linear', 'sigmoid'],
'gamma':[1e-4,1e-3,1e-2,1e-1,1,1e+1,1e+2,1e+3,1e+4],
'C':[1e-4,1e-3,1e-2,1e-1,1,1e+1,1e+2,1e+3,1e+4]}
grid = GridSearchCV(SVC(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## SVM validation
print("For SVM, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = SVC(C=0.1, gamma=0.0001, kernel="linear", random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For SVM accuracy score is", np.mean(accuracy_scores))
print("For SVM the precision score is", np.mean(precision_scores))
print("For SVM the recall score is", np.mean(recall_scores))
print("For SVM the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
With the initial and engineered features and with an SVM tuned model I can say that I am impressed, the precision has risen, however the recall for me is pretty low and below 0.3, my personal target threshold.
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
## DecisionTreeClassifier tuning
param_grid = {'criterion':['gini','entropy'],
'max_depth':[2,3,4,5,6,7,8,9,10,15,20,25,30],
'min_samples_split':[2,3,4,5],
'min_samples_leaf':[1,2,3,4],
'max_features' : [None, 'sqrt', 'log2', 'auto']}
grid = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## DecisionTreeClassifier validation
print("For DecisionTreeClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3, max_features=None, min_samples_leaf=1, min_samples_split=2, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For DecisionTreeClassifier accuracy score is", np.mean(accuracy_scores))
print("For DecisionTreeClassifier the precision score is", np.mean(precision_scores))
print("For DecisionTreeClassifier the recall score is", np.mean(recall_scores))
print("For DecisionTreeClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Decision Tree Classifier continues to provide expectional validation performance, but there is a tradeoff, after tuning the average precision increased its value from the previous one without tuning and respectively the average recall decreased.
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### validate, tune and evaluate ML models
## RandomForestClassifier tuning
param_grid = {'n_estimators':[10,20,30,40],
'max_depth':[2,3,4,5,6],
'min_samples_split':[2,3,4,5,6],
'min_samples_leaf':[1,2,3,4,5,6]}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## RandomForestClassifier best params
print("For RandomForestClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = RandomForestClassifier(n_estimators=20, max_depth=5, min_samples_leaf=1, min_samples_split=2, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For RandomForestClassifier accuracy score is", np.mean(accuracy_scores))
print("For RandomForestClassifier the precision score is", np.mean(precision_scores))
print("For RandomForestClassifier the recall score is", np.mean(recall_scores))
print("For RandomForestClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
With the addition of the new features Random Forest starts to increase its performance and its prediction ability to identify POIs. Both its precision and recall increased and passed my personal threshold which is the value of 0.3.
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### validate, tune and evaluate ML models
## RandomForestClassifier tuning
param_grid = {'n_estimators':[20,30,40,50,60,70,80],
'learning_rate':[1.0, 0.5, 0.1, 0.05, 0.01]}
grid = GridSearchCV(AdaBoostClassifier(random_state=42), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## DecisionTreeClassifier validation
print("For AdaBoostClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
# Adaboost Tuning
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = AdaBoostClassifier(learning_rate=0.1, n_estimators=50, random_state=42)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For AdaBoostClassifier accuracy score is", np.mean(accuracy_scores))
print("For AdaBoostClassifier the precision score is", np.mean(precision_scores))
print("For AdaBoostClassifier the recall score is", np.mean(recall_scores))
print("For AdaBoostClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Once again, Adaboost does not surprice me, it is a very robust algorithm especially with the tuning that increased its average precision even more.
from sklearn.model_selection import GridSearchCV
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
###### validate, tune and evaluate ML models
## KNeighborsClassifier tuning
param_grid = {'n_neighbors':[2,3,4,5,6,7,8,9,10,11,12,13,14],
'weights':['uniform', 'distance'],
'algorithm':['ball_tree', 'kd_tree', 'brute', 'auto']}
grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, cv=sss)
grid.fit(X, y)
## DecisionTreeClassifier validation
print("For KNeighborsClassifier, The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
clf = KNeighborsClassifier(algorithm='ball_tree', n_neighbors=2, weights='distance')
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For kNN accuracy score is", np.mean(accuracy_scores))
print("For kNN the precision score is", np.mean(precision_scores))
print("For kNN the recall score is", np.mean(recall_scores))
print("For kNN the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
KNN is the another big surprise, like the Random Forest Classifier and here the introduction of the new engineered features was beneficial for the algorithm and combined with the hyperparameter tuning it increased its average validation metrics and passed my personal threshold (0.3).
Evaluation Metric / ML Model | SVM | Decision Tree Classifier | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|
Accuracy | 0.863 | 0.886 | 0.886 | 0.898 | 0.869 |
Precision | 0.551 | 0.616 | 0.720 | 0.705 | 0.590 |
Recall | 0.210 | 0.562 | 0.318 | 0.537 | 0.442 |
F1 measure | 0.286 | 0.557 | 0.416 | 0.582 | 0.482 |
Here we have now some interesting results, as usual the DecisionTree Classifier and the Adaboost classifier keeps to provide execellent performance with tuned parameters together with initial and engineered features, but now the KNN and the Random Forest after tuning show some valuable results.
From now on I will rule out some ML models for my experiments. I will stop working with SVM and Naive Bayes because they do not perform well regardless the effort I put through. I am very worried about Random Forest Classifier and KNN, they both struggled to pass my personal threshold which is above 0.3 for all the average validation metrics, Hence I will rule them out too. Hence I will continue to work with Decision Tree classifier and Adaboost that exceed my expectations even from the first test.
Now I will continue with Feature Selection based on Feature Importances. By reducing the number of features I can reduce the training time for the ML models and may improve their performance.
There are many methods for feature selection in sklearn. One is to look at the feature importances for a classifier and modify the list of features to exclude those with an importance passed a chosen threshold. Another is to use SelectKBest and have the k-best features, defined by the amount of variance explained.
I will start by looking at the feature importances for both the DecisionTree and the AdaBoost Classifier
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
pred = tree_clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For DecisionTreeClassifier accuracy score is", np.mean(accuracy_scores))
print("For DecisionTreeClassifier the precision score is", np.mean(precision_scores))
print("For DecisionTreeClassifier the recall score is", np.mean(recall_scores))
print("For DecisionTreeClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X, y):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
boost_clf = AdaBoostClassifier(random_state=42)
boost_clf.fit(X_train, y_train)
pred = boost_clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For AdaBoostClassifier accuracy score is", np.mean(accuracy_scores))
print("For AdaBoostClassifier the precision score is", np.mean(precision_scores))
print("For AdaBoostClassifier the recall score is", np.mean(recall_scores))
print("For AdaBoostClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
# feature Selection based on Feature Importances from DecisionTree
tree_feature_importances = (tree_clf.feature_importances_)
tree_features = zip(tree_feature_importances, df_without_outliers_names_and_email_addresses.columns)
tree_features = sorted(tree_features, key= lambda x:x[0], reverse=True)
# Display the feature names and importance values
print('DecisionTree Feature Importances:\n')
for i in range(9):
print('{} : {:.4f}'.format(tree_features[i][1], tree_features[i][0]))
# feature Selection based on Feature Importances
boost_feature_importances = (boost_clf.feature_importances_)
boost_features = zip(boost_feature_importances, df_without_outliers_names_and_email_addresses.columns)
boost_features = sorted(boost_features, key= lambda x:x[0], reverse=True)
# Display the feature names and importance values
print('Adaboost Feature Importances:\n')
for i in range(9):
print('{} : {:.4f}'.format(boost_features[i][1], boost_features[i][0]))
Both the Feature Importances between Decision Tree Classifier and Adaboost Classifier do not agree each other. In order to continue with feature selection I will use the SelectKBest method from sklearn. I will use GridSearchCV with SelectKBest to find the optimal number of features for the classifiers.
SelectKBest defaults to scoring parameters using the ANOVA F-value which is a measure of variation between sample means. It describes how much of the variance between labels is explained by a particular feature. A higher value therefore means that there is more variation in that feature between person of interests and non persons of interest. The following table summarizes the ANOVA F-Value returned by SelectKBest.
from sklearn.feature_selection import SelectKBest
selector=SelectKBest(k=len(df_without_outliers_names_and_email_addresses_and_feature_engineered.columns))
selector.fit(X,y)
k_best_features_scores = pd.DataFrame([df_without_outliers_names_and_email_addresses_and_feature_engineered.columns, selector.scores_]).transpose()
k_best_features_scores.columns = ["feature", "score"]
k_best_features_scores = k_best_features_scores.sort_values("score", ascending=False).reset_index(drop=True)
k_best_features_scores
based on the table above I will now use GridSearchCV method from sklearn to find the top K features per ML model (Decision Tree Classifier / Adaboost). The number of the top K features will be selected per ML model is the one that will maximize the F1-score from GridSearchCV.
from sklearn.pipeline import Pipeline
n_features = np.arange(1, len(df_without_outliers_names_and_email_addresses_and_feature_engineered.columns))
# Create a pipeline of feature selection and classifier
pipe = Pipeline([
('select_features', SelectKBest()),
('classify', DecisionTreeClassifier(random_state=42))
])
param_grid = [
{
'select_features__k': n_features
}
]
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
# Use GridSearchCV to automate the process of finding the optimal number of features
tree_clf= GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv = sss)
tree_clf.fit(X, y)
tree_clf.best_params_
n_features = np.arange(1, len(df_without_outliers_names_and_email_addresses.columns))
# Create a pipeline of feature selection and classifier
pipe = Pipeline([
('select_features', SelectKBest(k=10)),
('classify', DecisionTreeClassifier(random_state=42))
])
param_grid = {
'classify__criterion' : ['gini', 'entropy'],
'classify__min_samples_split' : [2,3,4,5,6,7,8,9,10,12,14,16,18,20],
'classify__max_depth' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,20],
'classify__max_features' : ['sqrt', 'log2', 'auto']
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
# Use GridSearchCV to automate the process of finding the optimal number of features
tree_clf= GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv = sss)
tree_clf.fit(X, y)
tree_clf.best_params_
df_tree = df_without_outliers_names_and_email_addresses_and_feature_engineered[k_best_features_scores.feature[0:10]]
# Scale the dataset and send it back to a dictionary
scaled_df_tree = df_tree
scaled_df_tree = scale(scaled_df_tree)
scaled_df_tree = pd.DataFrame(scaled_df_tree)
scaled_df_tree.columns = df_tree.columns
#scaled_df["poi"] = df_without_outliers["poi"]
scaled_df_tree.insert(loc=0, column='poi', value=df_without_outliers["poi"].values)
X_tree = np.array(scaled_df_tree.drop(["poi"], axis=1))
y_tree = np.array(scaled_df_tree["poi"])
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X_tree, y_tree):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_tree[train_index], X_tree[test_index]
y_train, y_test = y_tree[train_index], y_tree[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
pred = tree_clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For DecisionTreeClassifier accuracy score is", np.mean(accuracy_scores))
print("For DecisionTreeClassifier the precision score is", np.mean(precision_scores))
print("For DecisionTreeClassifier the recall score is", np.mean(recall_scores))
print("For DecisionTreeClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X_tree, y_tree):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_tree[train_index], X_tree[test_index]
y_train, y_test = y_tree[train_index], y_tree[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
tree_clf = DecisionTreeClassifier(criterion='entropy', max_depth=7, max_features='sqrt', min_samples_split=2, random_state=42)
tree_clf.fit(X_train, y_train)
pred = tree_clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For DecisionTreeClassifier accuracy score is", np.mean(accuracy_scores))
print("For DecisionTreeClassifier the precision score is", np.mean(precision_scores))
print("For DecisionTreeClassifier the recall score is", np.mean(recall_scores))
print("For DecisionTreeClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
n_features = np.arange(1, len(df_without_outliers_names_and_email_addresses.columns))
# Create a pipeline of feature selection and classifier
pipe = Pipeline([
('select_features', SelectKBest()),
('classify', AdaBoostClassifier(random_state=42))
])
param_grid = {
'select_features__k': n_features
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
# Use GridSearchCV to automate the process of finding the optimal number of features
boost_clf= GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv = sss)
boost_clf.fit(X, y)
boost_clf.best_params_
n_features = np.arange(1, len(df_without_outliers_names_and_email_addresses.columns))
# Create a pipeline of feature selection and classifier
pipe = Pipeline([
('select_features', SelectKBest(k=12)),
('classify', AdaBoostClassifier(random_state=42))
])
param_grid = {
'classify__n_estimators':[20,30,40,50,60,70,80,90],
'classify__learning_rate':[0.05, 0.01, 0.1, 0.5]
}
sss = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
# Use GridSearchCV to automate the process of finding the optimal number of features
boost_clf= GridSearchCV(pipe, param_grid=param_grid, scoring='f1', cv = sss)
boost_clf.fit(X, y)
boost_clf.best_params_
df_boost = df_without_outliers_names_and_email_addresses_and_feature_engineered[k_best_features_scores.feature[0:12]]
# Scale the dataset and send it back to a dictionary
scaled_df_boost = df_boost
scaled_df_boost = scale(scaled_df_boost)
scaled_df_boost = pd.DataFrame(scaled_df_boost)
scaled_df_boost.columns = df_boost.columns
#scaled_df["poi"] = df_without_outliers["poi"]
scaled_df_boost.insert(loc=0, column='poi', value=df_without_outliers["poi"].values)
X_boost = np.array(scaled_df_boost.drop(["poi"], axis=1))
y_boost = np.array(scaled_df_boost["poi"])
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X_tree, y_tree):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_boost[train_index], X_boost[test_index]
y_train, y_test = y_boost[train_index], y_boost[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
boost_clf = AdaBoostClassifier(random_state=42)
boost_clf.fit(X_train, y_train)
pred = boost_clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For AdaBoostClassifier accuracy score is", np.mean(accuracy_scores))
print("For AdaBoostClassifier the precision score is", np.mean(precision_scores))
print("For AdaBoostClassifier the recall score is", np.mean(recall_scores))
print("For AdaBoostClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
confusion_matrices = np.zeros((2,2), dtype="float")
sss = StratifiedShuffleSplit(n_splits=1000, test_size=0.3, random_state=42)
for train_index, test_index in sss.split(X_tree, y_tree):
#print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X_boost[train_index], X_boost[test_index]
y_train, y_test = y_boost[train_index], y_boost[test_index]
non_poi_count = np.count_nonzero(y_test == False) * 1000.0
poi_count = np.count_nonzero(y_test == True) * 1000.0
boost_clf = AdaBoostClassifier(learning_rate=0.1, n_estimators=40, random_state=42)
boost_clf.fit(X_train, y_train)
pred = boost_clf.predict(X_test)
accuracy_scores.append(accuracy_score(y_test, pred))
precision_scores.append(precision_score(y_test, pred))
recall_scores.append(recall_score(y_test, pred))
f1_scores.append(f1_score(y_test, pred))
confusion_matrices += confusion_matrix(y_test, pred)
print("For AdaBoostClassifier accuracy score is", np.mean(accuracy_scores))
print("For AdaBoostClassifier the precision score is", np.mean(precision_scores))
print("For AdaBoostClassifier the recall score is", np.mean(recall_scores))
print("For AdaBoostClassifier the f1 score is", np.mean(f1_scores))
print("Normalized confusion matrix after 1000 iterations of Stratified Shuffle Split:")
confusion_matrices[0] = confusion_matrices[0] / non_poi_count
confusion_matrices[1] = confusion_matrices[1] / poi_count
print(confusion_matrices)
print("\nLast iteration Stratified Shuffle Split confusion matrix:")
skplt.metrics.plot_confusion_matrix(y_test, pred)
Evaluation Metric / ML Model | Decision Tree Classifier | Adaboost |
---|---|---|
Accuracy | 0.870 | 0.874 |
Precision | 0.575 | 0.602 |
Recall | 0.526 | 0.472 |
F1 measure | 0.526 | 0.502 |
Evaluation Metric / ML Model | Decision Tree Classifier | Adaboost |
---|---|---|
Accuracy | 0.866 | 0.887 |
Precision | 0.566 | 0.664 |
Recall | 0.486 | 0.492 |
F1 measure | 0.497 | 0.535 |
With feature Selection and default parameters comparing the 2 ML models using the f1 metric which combines the precision and recall DecisionTree Classifier performs better than the Adaboost. But with tuning Adaboost performs even better. Nevertheless, if we look the following section which compares the validation metrics from all the experiments we will see that Feature Selection reduced our overall performance for all the average validation metrics for both the Decision Tree and Adaboost classifiers.
Evaluation Metric | SVM | Decision Tree Classifier | Gaussian Naive Bayes | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|---|
Accuracy | 0.8558 | 0.837 | 0.606 | 0.871 | 0.873 | 0.840 |
Precision | 0.0317 | 0.434 | 0.265 | 0.604 | 0.588 | 0.123 |
Recall | 0.0055 | 0.400 | 0.65 | 0.657 | 0.447 | 0.037 |
F1 measure | 0.0093 | 0.396 | 0.355 | 0.358 | 0.484 | 0.054 |
Evaluation Metric / ML Model | SVM | Decision Tree Classifier | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|
Accuracy | 0.864 | 0.863 | 0.874 | 0.882 | 0.853 |
Precision | 0.368 | 0.500 | 0.622 | 0.642 | 0.036 |
Recall | 0.087 | 0.367 | 0.238 | 0.399 | 0.007 |
F1 measure | 0.137 | 0.386 | 0.377 | 0.464 | 0.012 |
Evaluation Metric / ML Model | SVM | Decision Tree Classifier | Gaussian Naive Bayes | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|---|
Accuracy | 0.854 | 0.869 | 0.622 | 0.882 | 0.889 | 0.852 |
Precision | 0.011 | 0.563 | 0.277 | 0.682 | 0.660 | 0.335 |
Recall | 0.002 | 0.543 | 0.647 | 0.304 | 0.537 | 0.093 |
F1 measure | 0.003 | 0.529 | 0.397 | 0.377 | 0.564 | 0.14 |
Evaluation Metric / ML Model | SVM | Decision Tree Classifier | Random Forest | Adaboost | k-NN |
---|---|---|---|---|---|
Accuracy | 0.867 | 0.888 | 0.889 | 0.904 | 0.871 |
Precision | 0.605 | 0.625 | 0.751 | 0.741 | 0.634 |
Recall | 0.232 | 0.565 | 0.325 | 0.527 | 0.262 |
F1 measure | 0.315 | 0.561 | 0.428 | 0.587 | 0.352 |
Evaluation Metric / ML Model | Decision Tree Classifier | Adaboost |
---|---|---|
Accuracy | 0.870 | 0.874 |
Precision | 0.575 | 0.602 |
Recall | 0.526 | 0.472 |
F1 measure | 0.526 | 0.502 |
Evaluation Metric / ML Model | Decision Tree Classifier | Adaboost |
---|---|---|
Accuracy | 0.866 | 0.887 |
Precision | 0.566 | 0.664 |
Recall | 0.486 | 0.492 |
F1 measure | 0.497 | 0.535 |
Just by comparing all the tables and trying to reach to a conclusion wich ML model is the best fit for this dataset providing the fact that we are dealing with the issue of class imbalance I will go with the Adaboost with the setup of initial features and the new enginnered that were introduced and with hyperparameter tuning. My second choice would be the Decision Tree classifiers.
Based on my experiments, our best models produced exceptional results with an average accuracy of 85% - 90%, average precision of 65% - 75% average recall of 45% - 50%, I believe that out ML models have done well considering the small amount of data and the issue of the class imbalance. The dataset is very small and more effort in feature engineering must be put to increase the validation metrics performace. Furthermore, I believe in the future I have to increase the range of my random seed due to the fact that I used only one random seed the value of 42 and I may fall into the trap of overfitting.
I forgot to mention the textual data from email, I noticed that by introducing more features like engineered ones, increase the overall average performance, thus I suggest that in the future adding the textual information by using techniques such as bag of words, stemming, lemmatization tf-idf etc. may increase even more the ML models' validation metrics.
As for my final selection of features and ML model. I will go with Adaboost classifier, from the beginning it was a very robust algorithm fitting very well to my dataset. As for the feature that I will finally select, I will select all the initial features after the preprocessing of their missing values including the new engineered features.
Due to the Udacity Data Analyst Nanodegree requires to work and submit the poi_id.py and the tester.py I will now focus to work with these 2 files.
# Preparing the my_classifier.pkl, my_dataset.pkl and my_feature_list.pkl
features_list = ['poi', 'bonus', 'total_stock_value', 'exercised_stock_options',
'shared_receipt_with_poi']
my_dataset = pickle.load(open("final_project_dataset.pkl", "rb"))
clf = DecisionTreeClassifier(random_state=42)
CLF_PICKLE_FILENAME = "my_classifier.pkl"
DATASET_PICKLE_FILENAME = "my_dataset.pkl"
FEATURE_LIST_FILENAME = "my_feature_list.pkl"
def dump_classifier_and_data(clf, dataset, feature_list):
with open(CLF_PICKLE_FILENAME, 'wb') as clf_outfile:
pickle.dump(clf, clf_outfile)
with open(DATASET_PICKLE_FILENAME, 'wb') as dataset_outfile:
pickle.dump(dataset, dataset_outfile)
with open(FEATURE_LIST_FILENAME, 'wb') as featurelist_outfile:
pickle.dump(feature_list, featurelist_outfile)
dump_classifier_and_data(clf, my_dataset, features_list)
%run tester.py
I will not continue to continue the experimentation with the various ML models due to the fact that I did this above. Experimentation is a continuous procedure in order to find the best features and the classifiers that fits well in this dataset.
%run poi_id.py
%run tester.py
As mentioned above these executions is just for the project's submission only and the discussion of the models' performance was discussed above and talked about the feature engineering and feature selection as well as the hyperparameter tuning.