White Wines Exploratory Analysis by Praxitelis-Nikolaos Kouroupetroglou

This project deals with the Exploratory Data Analysis using R using the white wine dataset and explores the relationships between features and the “Quality” rating.

The format includes Univariate, Bivariate, and Multivariate analyses with a final summary and reflection at the end. The original dataset can be found here:

This dataset contains information about Portuguese white variants of Vinho Verde wine. It includes 4898 observations of 12 features. 11 of the features are chemical variables (independent variables), and the other feature is wine quality (dependent variable), a subjective measure that is the median of the opinions of three wine experts. Specifically, the features are:

fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid: found in small quantities, citric acid can add freshness and flavor to wines
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides: the amount of salt in the wine
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
quality: discrete score between 0 (worst) and 10 (best).

Their conrresponded measurement types are the following: Input variables (based on physicochemical tests): - fixed acidity (tartaric acid - g / dm^3) - volatile acidity (acetic acid - g / dm^3) - citric acid (g / dm^3) - residual sugar (g / dm^3) - chlorides (sodium chloride - g / dm^3 - free sulfur dioxide (mg / dm^3) - total sulfur dioxide (mg / dm^3) - density (g / cm^3) - pH - sulphates (potassium sulphate - g / dm3) - alcohol (% by volume) - Output variable (based on sensory data): quality (score between 0 and 10)

(These descriptions have been taken from dataset’s main site)

White Wine dataset summary statistics

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

There are 4.898 observations and every observation has 12 variables of interest. We will exclude the variable X from the analysis (because X is simply a counter for each observation, from 1 to 4,898). There are 11 chemical properties (e.g fixed acidity, volatile acidity etc.) and 1 measure of quality. The main feature in the dataset is ‘quality’, since this is the ultimate measure of each wine and is the variable that one would like to predict.

It is important to note that the minumum white wine quality is 3 and the maximum is 9 and there are no 1 or 10 values of quality. The median quality is 6 and the mean is 5.878.

All of the other features have a minimum value greater than 0 except for citric acid. Most pH values fall between 3 and 3.3.

Residual sugar may have an interesting distribution because its 3rd Quartile and its maximum value is relatively high compared to the mean and median.

Univariate Plots Section

volatile.acidity Histogram, Boxplot and Summary univariate statistics

## 
## min value: 0.08 
## 25th quantile: 0.21 
## median (orange color): 0.26 
## mean (green color): 0.2782411 
## mode (pink color): 0.28 
## 75th quantile: 0.32 
## max value: 1.1 
## IQR value: 0.11 
## skewness value: 1.576014 
## kurtosis value: 5.081904

The Volatile acidity is right skewed, that explains the skewness values of 1.57 and it comes with many outliers above the value of 0.45 g/L, while the median is 0.26 g/L, and the minimum value is 0.08 g/L. Its distribution kurtosis is 5.08 which means that the distribution curve shape is a leptokurtic.

citric.acid Histogram, Boxplot and Summary univariate statistics

## 
## min value: 0 
## 25th quantile: 0.27 
## median (orange color): 0.32 
## mean (green color): 0.3341915 
## mode (pink color): 0.3 
## 75th quantile: 0.39 
## IQR value: 0.12 
## max value: 1.66 
## skewness value: 1.281135 
## kurtosis value: 6.163631

The Citric acid is normally distributed to right skewed shape, its skewness value is 1.28. There is a spike in the histogram 0.5 and 0.75 g/dm^3. Additionally, there are several outliers below and above the mean. Moreover, its kurtosis valus is 6.16 which means that its distribution shape is leptokurtic.

residual.sugar Histogram, Boxplot and Summary univariate statistics

## 
## min value: 0.6 
## 25th quantile: 1.7 
## median (orange color): 5.2 
## mean (green color): 6.391415 
## mode (pink color): 1.2 
## 75th quantile: 9.9 
## IQR value: 8.2 
## max value: 65.8 
## skewness value: 1.076434 
## kurtosis value: 3.462415

Residual sugar mostly right skewed, its skewness value is 1.28, with most of the data at the first 25th quartile of 1.7 g/dm^3. There are a few outliers above the value of 20g/dm^3. Due to its high right skewness, I will log transform this feature. Finally its distribution shape is leptokurtic because its kurtosis value is 6.16. Due to its high right skewness value it must be transfored using the log transformation.

residual.sugar with log transform Histogram, Boxplot and Summary univariate statistics

## 
## min value: -0.5108256 
## 25th quantile: 0.5306283 
## median (orange color): 1.648659 
## mean (green color): 1.480928 
## mode (pink color): 0.1823216 
## 75th quantile: 2.292535 
## IQR value: 1.761907 
## max value: 4.18662 
## skewness value: -0.1610582 
## kurtosis value: -1.352864

The above histogram shows that a log transform of residual sugar results in an almost bimodal distribution, with peaks around 0.5 and 2 g/dm^3.

chlorides Histogram, Boxplot and Summary univariate statistics

## 
## min value: 0.009 
## 25th quantile: 0.036 
## median (orange color): 0.043 
## mean (green color): 0.04577236 
## mode (pink color): 0.044 
## 75th quantile: 0.05 
## IQR value: 0.014 
## max value: 0.346 
## skewness value: 5.020254 
## kurtosis value: 37.50849

Chlorides is right skewed distributed with many outliers above the third quartile of 0.05 g/dm^3, up to a max value of 0.346 g/dm^3. Its distribution shape is leptokurtic with kurtosis valus equal to 37.5 and it is highly reight skewed with skeweness valus equal to 5.02. Due to its high kurtosis value it must be transformed using the log transformation.

chlorides with log transform Histogram, Boxplot and Summary univariate statistics

## 
## min value: -4.710531 
## 25th quantile: -3.324236 
## median (orange color): -3.146555 
## mean (green color): -3.149011 
## mode (pink color): -3.123566 
## 75th quantile: -2.995732 
## IQR value: 0.3285041 
## max value: -1.061317 
## skewness value: 1.133439 
## kurtosis value: 5.289989

Now the log transformation has resulted into an almost normal distribution with skewness value equal to 1.13 which is close to 0. However the outliers still exist.

free.sulfur.dioxide Histogram, Boxplot and Summary univariate statistics

## 
## min value: 2 
## 25th quantile: 23 
## median (orange color): 34 
## mean (green color): 35.30808 
## mode (pink color): 29 
## 75th quantile: 46 
## IQR value: 23 
## max value: 289 
## skewness value: 1.405883 
## kurtosis value: 11.44751

Free sulfur dioxide is right skewed and almost normally distributed, it has skewness value equal to 1.4 with a few outliers above about 75 mg/dm^3. The median is 34 mg/dm^3, and the max value is 289 mg/dm^3. Finallyits distribution shape is leptokurtic with kurtosis value equal to 11.44.

total.sulfur.dioxide Histogram, Boxplot and Summary univariate statistics

## 
## min value: 9 
## 25th quantile: 108 
## median (orange color): 134 
## mean (green color): 138.3607 
## mode (pink color): 111 
## 75th quantile: 167 
## IQR value: 59 
## max value: 440 
## skewness value: 0.3904706 
## kurtosis value: 0.5685873

As for the free sulfur dioxide, its ditribution is almost normal with signs of right skewness (0.39 skewness value) and there is a large range for this feature. The mean is 138.4 mg/dm^3, while the max value is 440 mg/dm^3. Moreover, its distribution shape is almost normal (0.56 kurtosis value).

density Histogram, Boxplot and Summary univariate statistics

## 
## min value: 0.98711 
## 25th quantile: 0.9917225 
## median (orange color): 0.99374 
## mean (green color): 0.9940274 
## mode (pink color): 0.992 
## 75th quantile: 0.9961 
## IQR value: 0.0043775 
## max value: 1.03898 
## skewness value: 0.9771742 
## kurtosis value: 9.777368

Density is almost normally distributed to right skewed distribution (skewness value equal to 0.977), and it is the feature with the least amount of outliers. The minimum is 0.9871 and the maximum is 1.0390 g/cm^3.

PH Histogram, Boxplot and Summary univariate statistics

## 
## min value: 2.72 
## 25th quantile: 3.09 
## median (orange color): 3.18 
## mean (green color): 3.188267 
## mode (pink color): 3.14 
## 75th quantile: 3.28 
## IQR value: 0.19 
## max value: 3.82 
## skewness value: 0.4575022 
## kurtosis value: 0.5275677

The pH feature is almost normally distributed with a mean of 3.18, and a few outliers below and above the mean.

sulphates Histogram, Boxplot and Summary univariate statistics

## 
## min value: 0.22 
## 25th quantile: 0.41 
## median (orange color): 0.47 
## mean (green color): 0.4898469 
## mode (pink color): 0.5 
## 75th quantile: 0.55 
## IQR value: 0.14 
## max value: 1.08 
## skewness value: 0.4870435 
## kurtosis value: -0.6998768

Sulphates is slightly right skewed, with a few outliers above the mean of 0.4898, starting at about 0.8 g/dm^3. The minimum value is 0.22 and the max value is 1.08 g/dm^3.

alcohol Histogram, Boxplot and Summary univariate statistics

## 
## min value: 8 
## 25th quantile: 9.5 
## median (orange color): 10.4 
## mean (green color): 0.4898469 
## mode (pink color): 9.4 
## 75th quantile: 11.4 
## IQR value: 1.9 
## max value: 14.2 
## skewness value: 0.4870435 
## kurtosis value: -0.6998768

The amount of alcohol by volume in a white wine is a flat distribution, with a range of 8 to 14.2% by volume. It is worthy to report that the white wines alcohol by volume has no outliers at all.

White wine Quality Histogram

Finally lets look the white wines’ quality distribution. The above histogram shows that most of the wines in the dataset are good, with a quality between 5 and 7. Most of the wines have a quality of 6.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 4898 observations of 12 features. All of the features are numeric, and quality is a discrete variable that takes on the integers from 0 to 10.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality, as that is the feature that can be predicted from the others.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I neleive that residual sugar, volatile acidity, and alcohol will help support my exploration with my feature of interest, as the first two features have skewed distributions, and alcohol has a relatively flat (large spread) distribution.

Did you create any new variables from existing variables in the dataset?

No, I did not.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the feature residual sugar since it was a feature that was highly right skewed.

Bivariate Plots Section

In this section I will start working with bivariate plots and the correlations between the datasets’ features. The following diagram shows the positive and negative correlations between the white wine dataset’s features

Correlation between Features

The correlation plot shows that strong correlations exist between the following features: 1. alcohol - quality, strong positive correlation, (0.44 pearson) 2. total.sulfur.dioxide - quality, small negative correlation, (-0.17 pearson) 3. density - quality, negative correlation, (-0.31 pearson) 4. chlorides.log - quality, negative correlation, (-0.27 pearson) 5. fixed.acidity - quality, negative correlation, (-0.11 pearson) 6. volatile.acidity - quality, negative correlation, (-0.19 pearson) 7. alcohol - density, very strong negative correlation, (-0.78 pearson) 8. pH - fixed.acidity, strong negative correlation, (-0.43 pearson) 9. density - residual.sugar, very strong positive correlation (0.84 pearson) 10. alcohol - residual.sugar.log, strong negative correlation (-0.39 pearson) 11. density - residual.sugar.log, very strong positive correlation (0.76 pearson) 12. alcohol - total.sulfur.dioxide, strong negative correlation (-0.45 pearson) 13. density - total.sulfur.dioxide, strong positive correlation (0.53 pearson) 14. free.sulfur.dioxide - total.sulfur.dioxide, strong positive correlation, (0.62 pearson) 15. total.sulfur.dioxide - residual.sugar.log, strong positive correlation, (0.53 pearson) 16. alcohol - chlorides.log, strong negative correlation, (-0.5 pearson)

Lets start exploring one by one all the combinations by ploting them. Especially due to the target is the exploration around the quality of the white wine I will plot some of the relations between the variables.

quality - alcohol bivariate plot

The above violin plot cannot coclude to a main result. We can say that from quality 6 to 9, it shows that in general, better wines have higher alcohol levels.

quality - density bivariate plot

The relationship between density and quality seems to be quite strong: higher quality wines from rating of 7 or higher appear to be lower density compared to lower quality wines (quality rating of 5 or lower), based on the large differences in the median density observed between the quality extremes.

quality - chlorides bivariate plot

It appears that in general, higher quality wines have lower chloride levels, since the median value of chlorides drops when the white wine quality increasing.

quality - total.sulfur.dioxide bivariate plot

It seems that there is a pattern between white wine quality and total.sulfur.dioxide. it appears that higher level of white wine quality has lower amounts of total.sulfur.dioxide. Especially from quality levels from 6 to 9, the median of the distributions of total.sulfur.dioxide show a constant decrease.

quality - fixed.acidity bivariate plot

It seems that there is a pattern between white wine quality and the different fixed.acidity distributions. it appears that higher level of white wine quality tend to have lower amounts of total.sulfur.dioxide. Especially from quality levels from 3 to 8, the median of the distributions of total.sulfur.dioxide show a constant decrease. There is an exception, the quality with value 9, its fixed.acidity distribution does not follow the notion described above.

quality - volatile.acidity bivariate plot

alcohol - density scatterplot

The alcohol - density scatterplot exhibits very strong negative correlation (-0.78 pearson), and it is apparent to the graph above.

pH - fixed.acidity scatterplot

The above scatterplot shows that there is a negative correlation between wines’ pH and fixed.acidity. This means that as the PH increase the acidity decreases which makes perfectly sense, because as PH increase from 2 to 7 the wine and the solutions in general lose their acidity.

density - residual.sugar scatterplot

The density - residual.sugar scatterplot exhibit a very strong positive correlation (0.84 pearson).

alcohol - residual.sugar scatterplot

In addition, the alcohol - residual.sugar scatterplot presents a strong negative correlation (-0.45 pearson), which means that as the level of alcohol increases in wines the amount of residual sugar decreases.

alcohol - total.sulfur.dioxide scatterplot

The alcohol percent levels per volume in white wines and total.sulfur.dioxide scatterplot shows a strong negative correlation (-0.45 pearson).

density - total.sulfur.dioxide scatterplot

Furthermore, the relation between white wines’ density and total.sulfur.dioxide appears to be correlate positively (with person coefficient equal to 0.53).

free.sulfur.dioxide - total.sulfur.dioxide scatterplot

The free.sulfur.dioxide - total.sulfur.dioxide scatterplot depicts a positive strong correlation (0.62 pearson).

total.sulfur.dioxide - residual.sugar log transformed scatterplot

Lets now plot the scatterplot between residual.sugar loged transformed and the total.sulfur.dioxide, they show a strong positive correlation (0.53 pearson correlation).

alcohol - chlorides log transformed scatterplot

Furthermore, another negative correlation between alcohol percent to volume in wine and log transformed chlorides.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the correlation table, there is a wide variaty of features that correlate either positive or negative for example the following list shows correlations between the dataset’s features:

alcohol - density, very strong negative correlation, (-0.78 pearson)
pH - fixed.acidity, strong negative correlation, (-0.43 pearson)
density - residual.sugar, very strong positive correlation (0.84 pearson)
alcohol - residual.sugar.log, strong negative correlation (-0.39 pearson)
density - residual.sugar.log, very strong positive correlation (0.76 pearson)
alcohol - total.sulfur.dioxide, strong negative correlation (-0.45 pearson)
density - total.sulfur.dioxide, strong positive correlation (0.53 pearson)
free.sulfur.dioxide - total.sulfur.dioxide, strong positive correlation, (0.62 pearson)
total.sulfur.dioxide - residual.sugar.log, strong positive correlation, (0.53 pearson)
alcohol - chlorides.log, strong negative correlation, (-0.5 pearson)

For the feature of interest which is the white wine quality there were some features that exhibit some correlation and the violin plots indicated its existance. The following list presents which features correlate with white wine quality:

alcohol - quality, strong positive correlation, (0.44 pearson)
total.sulfur.dioxide - quality, small negative correlation, (-0.17 pearson)
density - quality, negative correlation, (-0.31 pearson)
chlorides.log - quality, negative correlation, (-0.27 pearson)
fixed.acidity - quality, negative correlation, (-0.11 pearson)
volatile.acidity - quality, negative correlation, (-0.19 pearson)

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Some peculiar appeared at the correlation table. the feature Fixed acidity has a moderate negative correlation with pH, while volatile acidity does not have appear to have a strong relationship with pH.

What was the strongest relationship you found?

The strongest positive relationship that I found was between density and residual sugar.log (0.76 pearson). The strongest negative relationship was between alcohol and density (-0.78 pearson).

Multivariate Plots Section

In this part of the project I will recreate the previous scatterplots but now a new feature will be introduced. That is the white wine quality which is the feature of interest. It will be a great opportunity to depict not only the relations between 2 features but how the white wine quality react to these relations.

Density and Alcohol scatterplot based on Quality

The scatterplot above, shows the negative relation between alcohol levels and desity of water, but there is an additional information, this is the white wines’ Quality levels that is as the alcohol levels increases and the density of the water in sugar content or other substances decreases then the quality of the white wine increases also.

PH and fixed.acidity scatterplot based on Quality

The scatterplot above, shows again another negative relation between fixed.acidity levels and PH levels. However we can not conclude anything about the Quality in respect of the above relation. The white wine Quality distribution on this graph is very flat and no certain conclusions can be made. We may say that exhibits high Quality of wine PH levels close to 2.7 - 2.9 and fixed.acidity close to 10 - 11 levels and as the PH levels increases and the fixed.acidity decreases then the Quality is dropped from higher to lower levels.

density and residual.sugar log transformed scatterplot based on Quality

The scatter plot above presents the relationship between density and loged transformed residual sugar which shows a positive relationship between these two features. There is one outlier that has a density of about 1.04 g/dm^3, and residual sugar of about 65 g/dm^3. As we see, better rated wines have lower densities and higher sugar.

alcohol and residual.sugar log transformed scatterplot based on Quality

The scatter plot above depicts the negative relationship between alcohol percentage levels and residual.sugar. It is important to mention that as long as the percentage levels of alcohol increases and the residual sugar gramms per litter decreases then the white wine quality increases.

alcohol and total.sulfur.dioxide scatterplot based on Quality

The scatter plot above shows the negative relationship between alcohol percentage levels and total.sulfur.dioxide. It is important to mention that as long as the percentage levels of alcohol increases and the total.sulfur.dioxide decreases then the white wine quality increases.

density and total.sulfur.dioxide scatterplot based on Quality

The scatter plot above shows the positive relationship between density levels and total.sulfur.dioxide. It is important to mention that as long as the density increases and the total.sulfur.dioxide also increase then the white wine quality decreases. In the plot above it is clear that with smaller amounts of density and total.sulfur.dioxide exhibits higher quality of white wine.

free.sulfur.dioxide and total.sulfur.dioxide scatterplot based on Quality

The scatter plot above shows the positive relationship between free.sulfur.dioxide levels and total.sulfur.dioxide. It is deneficial to mention that as long as the free.sulfur.dioxide increases and the total.sulfur.dioxide also increase then the white wine quality also increases. In the plot above it is clear that with smaller amounts of free.sulfur.dioxide and total.sulfur.dioxide exhibits lower levels of quality for white wine which increases as long as the other two features values increase.

total.sulfur.dioxide and residual.sugar log transformed scatterplot based on Quality

Investing the insights from the above scatterplot between total.sulfur.dioxide and residual.sugar.log in respect of the white wine Quality some may say that as long as the values of total.sulfur.dioxide increase and the residual.sugar.log increase the white wine quality drops from 9 to at least 5 to 4.

alcohol and chlorides log transformed scatterplot based on Quality

In addition, the above scatterplot in respect of the white wine quality speaks for itself, the quality of white wine increases as long as the alcohol percent levels increases and the chlorides (log transformed) values decreases.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The strongest relationships are between quality and alcohol and density. Additionally, sugar has a strong negative relationship with density. Since better rated wines have a lower density, they will have less sugar. It seems that sweet wines are not liked as much by the experts.

Were there any interesting or surprising interactions between features?

I found it interesting that sugar and density are positively correlated, since I would not expect that these two features were related.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I will try to create several linear regression models based on the features that correlate strongly between them and with white wine quality:

Building linear regression model, predicting quality with alcohol as predictor.

## 
## Call:
## lm(formula = quality ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.582009   0.098008   26.34   <2e-16 ***
## alcohol     0.313469   0.009258   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: density

## 
## Call:
## lm(formula = quality ~ alcohol + density, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5670 -0.5242 -0.0003  0.4881  3.0898 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22.49170    6.16503  -3.648 0.000267 ***
## alcohol       0.36036    0.01478  24.389  < 2e-16 ***
## density      24.72842    6.07937   4.068 4.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.796 on 4895 degrees of freedom
## Multiple R-squared:  0.1925, Adjusted R-squared:  0.1921 
## F-statistic: 583.3 on 2 and 4895 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: chlorides log transformed

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6432 -0.5172  0.0023  0.4850  3.0462 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -23.06047    6.15128  -3.749  0.00018 ***
## alcohol         0.33431    0.01565  21.365  < 2e-16 ***
## density        24.95402    6.06493   4.114 3.94e-05 ***
## chlorides.log  -0.19638    0.03959  -4.961 7.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7941 on 4894 degrees of freedom
## Multiple R-squared:  0.1965, Adjusted R-squared:  0.196 
## F-statistic: 398.9 on 3 and 4894 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: fixed.acidity

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity, 
##     data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6405 -0.5241 -0.0066  0.4780  3.2115 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.32722    6.34530  -5.252 1.57e-07 ***
## alcohol         0.34718    0.01572  22.081  < 2e-16 ***
## density        35.74178    6.28478   5.687 1.37e-08 ***
## chlorides.log  -0.19882    0.03943  -5.042 4.78e-07 ***
## fixed.acidity  -0.08748    0.01404  -6.231 5.02e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7911 on 4893 degrees of freedom
## Multiple R-squared:  0.2028, Adjusted R-squared:  0.2022 
## F-statistic: 311.2 on 4 and 4893 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: volatile.acidity

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4471 -0.4952 -0.0352  0.4710  3.1943 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.79075    6.17952  -7.896 3.54e-15 ***
## alcohol            0.39566    0.01539  25.705  < 2e-16 ***
## density           51.64505    6.12476   8.432  < 2e-16 ***
## chlorides.log     -0.14037    0.03819  -3.675  0.00024 ***
## fixed.acidity     -0.10077    0.01357  -7.426 1.31e-13 ***
## volatile.acidity  -2.08196    0.10992 -18.940  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7636 on 4892 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2565 
## F-statistic: 338.9 on 5 and 4892 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: pH

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4414 -0.4949 -0.0336  0.4705  3.1981 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.95916    6.20865  -7.886 3.83e-15 ***
## alcohol            0.39632    0.01557  25.458  < 2e-16 ***
## density           51.89432    6.18825   8.386  < 2e-16 ***
## chlorides.log     -0.14048    0.03820  -3.678 0.000238 ***
## fixed.acidity     -0.10264    0.01509  -6.800 1.17e-11 ***
## volatile.acidity  -2.08414    0.11021 -18.911  < 2e-16 ***
## pH                -0.02296    0.08106  -0.283 0.777041    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7637 on 4891 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2564 
## F-statistic: 282.4 on 6 and 4891 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: total.sulfur.dioxide

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4342 -0.5004 -0.0296  0.4684  3.2203 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.426e+01  6.490e+00  -6.821 1.02e-11 ***
## alcohol               3.981e-01  1.558e-02  25.558  < 2e-16 ***
## density               4.706e+01  6.487e+00   7.254 4.68e-13 ***
## chlorides.log        -1.493e-01  3.834e-02  -3.893   0.0001 ***
## fixed.acidity        -1.020e-01  1.509e-02  -6.762 1.52e-11 ***
## volatile.acidity     -2.110e+00  1.106e-01 -19.069  < 2e-16 ***
## pH                   -3.459e-02  8.115e-02  -0.426   0.6699    
## total.sulfur.dioxide  7.583e-04  3.070e-04   2.470   0.0135 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7633 on 4890 degrees of freedom
## Multiple R-squared:  0.2582, Adjusted R-squared:  0.2572 
## F-statistic: 243.2 on 7 and 4890 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: residual.sugar log transformed

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log, 
##     data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5375 -0.4928 -0.0503  0.4658  3.1049 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.956e+01  1.126e+01   3.512 0.000448 ***
## alcohol               3.048e-01  1.856e-02  16.418  < 2e-16 ***
## density              -3.806e+01  1.138e+01  -3.345 0.000830 ***
## chlorides.log        -9.789e-02  3.845e-02  -2.546 0.010927 *  
## fixed.acidity        -2.789e-02  1.705e-02  -1.635 0.102026    
## volatile.acidity     -2.112e+00  1.097e-01 -19.251  < 2e-16 ***
## pH                    3.214e-01  8.955e-02   3.589 0.000335 ***
## total.sulfur.dioxide  4.676e-04  3.061e-04   1.527 0.126768    
## residual.sugar.log    2.198e-01  2.424e-02   9.068  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7571 on 4889 degrees of freedom
## Multiple R-squared:  0.2705, Adjusted R-squared:  0.2693 
## F-statistic: 226.6 on 8 and 4889 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: citric.acid

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log + 
##     citric.acid, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5514 -0.4930 -0.0481  0.4668  3.1010 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.049e+01  1.132e+01   3.576 0.000352 ***
## alcohol               3.029e-01  1.870e-02  16.199  < 2e-16 ***
## density              -3.902e+01  1.144e+01  -3.410 0.000654 ***
## chlorides.log        -1.002e-01  3.855e-02  -2.598 0.009394 ** 
## fixed.acidity        -2.992e-02  1.724e-02  -1.736 0.082678 .  
## volatile.acidity     -2.096e+00  1.115e-01 -18.798  < 2e-16 ***
## pH                    3.284e-01  8.997e-02   3.650 0.000265 ***
## total.sulfur.dioxide  4.449e-04  3.074e-04   1.447 0.147887    
## residual.sugar.log    2.213e-01  2.431e-02   9.103  < 2e-16 ***
## citric.acid           7.784e-02  9.615e-02   0.810 0.418248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7571 on 4888 degrees of freedom
## Multiple R-squared:  0.2706, Adjusted R-squared:  0.2692 
## F-statistic: 201.5 on 9 and 4888 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: free.sulfur.dioxide

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log + 
##     citric.acid + free.sulfur.dioxide, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8335 -0.4928 -0.0372  0.4602  3.1008 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.573e+01  1.133e+01   3.153 0.001625 ** 
## alcohol               3.052e-01  1.866e-02  16.359  < 2e-16 ***
## density              -3.435e+01  1.145e+01  -3.000 0.002715 ** 
## chlorides.log        -9.683e-02  3.846e-02  -2.518 0.011850 *  
## fixed.acidity        -2.145e-02  1.727e-02  -1.242 0.214319    
## volatile.acidity     -1.979e+00  1.136e-01 -17.422  < 2e-16 ***
## pH                    3.403e-01  8.977e-02   3.791 0.000152 ***
## total.sulfur.dioxide -6.650e-04  3.756e-04  -1.771 0.076669 .  
## residual.sugar.log    2.063e-01  2.443e-02   8.446  < 2e-16 ***
## citric.acid           5.901e-02  9.597e-02   0.615 0.538683    
## free.sulfur.dioxide   4.304e-03  8.407e-04   5.119 3.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7551 on 4887 degrees of freedom
## Multiple R-squared:  0.2745, Adjusted R-squared:  0.273 
## F-statistic: 184.9 on 10 and 4887 DF,  p-value: < 2.2e-16

Updating previous linear model, adding new predictor: sulphates

## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log + 
##     citric.acid + free.sulfur.dioxide + sulphates, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8524 -0.4881 -0.0356  0.4611  3.1109 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.578e+01  1.147e+01   3.991 6.68e-05 ***
## alcohol               2.905e-01  1.883e-02  15.426  < 2e-16 ***
## density              -4.452e+01  1.159e+01  -3.841 0.000124 ***
## chlorides.log        -9.833e-02  3.836e-02  -2.563 0.010400 *  
## fixed.acidity        -1.502e-02  1.727e-02  -0.870 0.384525    
## volatile.acidity     -1.950e+00  1.134e-01 -17.195  < 2e-16 ***
## pH                    3.132e-01  8.969e-02   3.491 0.000485 ***
## total.sulfur.dioxide -8.828e-04  3.770e-04  -2.342 0.019236 *  
## residual.sugar.log    2.280e-01  2.473e-02   9.221  < 2e-16 ***
## citric.acid           3.870e-02  9.581e-02   0.404 0.686319    
## free.sulfur.dioxide   4.380e-03  8.387e-04   5.222 1.84e-07 ***
## sulphates             5.048e-01  9.847e-02   5.126 3.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7532 on 4886 degrees of freedom
## Multiple R-squared:  0.2784, Adjusted R-squared:  0.2767 
## F-statistic: 171.3 on 11 and 4886 DF,  p-value: < 2.2e-16

I fitted several linear models to my dataset. The purpose is to predict the quality of the white wine based on the other features in the data. It includes features that have some correlation with the level of quality and features that correlate between them.

As I added additional features to the data the model grew steadily stronger. However, The R^2 values were low and that means that the several models that were built, do not reflect and characterize this data.

In order to improve the model, I may need additional data, study and feature engineer new features. Probably there is likely a better method than linear regression for this prediction model.Another approach can be to fit a unique linear model to different portions of the quality range.

Final Plots and Summary

Plot One

Description One

Most of the wines in the dataset have a quality of 6. Moreover, the quality of white wine is normally distributed, with most of wines having a rating between 5 and 7.

Plot Two

Description Two

The above scatterplot presents the relation between alcohol and chlorides (which is log transformed) in respect of the white wine quality and it speaks for itself, the quality of white wine increases as long as the alcohol percent levels increases and from the other hand the chlorides (log transformed) values decreases.

Plot Three

Description Three

This violin and boxplot depicts the strong relationship between alcohol and quality (0.44 pearson).

Reflection

From this exploratory analysis, we observed that good wine tends to have more alcohol levels, lower density, lower chlorides, volatile and fixed acidity and total sulfur dioxide. Due to the fact that density increases while sugar increased, sugar might be a bad factor for the flavor of wine, while alcohol is good for the flavor of wine. This analysis is based on correlation, so it does not imply any causation between the datasets’ features.

Limitations of the analysis are that the dataset is for white wines from a specific area, so the relationship between the variables might not correspond for different types of wine. Additionally, the quality for the wines might vary from place to place. To study the data even further, we could fetch more data for white wines from other regions all over the world.

References: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt