This project deals with the Exploratory Data Analysis using R using the white wine dataset and explores the relationships between features and the “Quality” rating.
The format includes Univariate, Bivariate, and Multivariate analyses with a final summary and reflection at the end. The original dataset can be found here:
This dataset contains information about Portuguese white variants of Vinho Verde wine. It includes 4898 observations of 12 features. 11 of the features are chemical variables (independent variables), and the other feature is wine quality (dependent variable), a subjective measure that is the median of the opinions of three wine experts. Specifically, the features are:
fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid: found in small quantities, citric acid can add freshness and flavor to wines
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides: the amount of salt in the wine
free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density: the density of water is close to that of water depending on the percent alcohol and sugar content
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
alcohol: the percent alcohol content of the wine
quality: discrete score between 0 (worst) and 10 (best).
Their conrresponded measurement types are the following: Input variables (based on physicochemical tests): - fixed acidity (tartaric acid - g / dm^3) - volatile acidity (acetic acid - g / dm^3) - citric acid (g / dm^3) - residual sugar (g / dm^3) - chlorides (sodium chloride - g / dm^3 - free sulfur dioxide (mg / dm^3) - total sulfur dioxide (mg / dm^3) - density (g / cm^3) - pH - sulphates (potassium sulphate - g / dm3) - alcohol (% by volume) - Output variable (based on sensory data): quality (score between 0 and 10)
(These descriptions have been taken from dataset’s main site)
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000There are 4.898 observations and every observation has 12 variables of interest. We will exclude the variable X from the analysis (because X is simply a counter for each observation, from 1 to 4,898). There are 11 chemical properties (e.g fixed acidity, volatile acidity etc.) and 1 measure of quality. The main feature in the dataset is ‘quality’, since this is the ultimate measure of each wine and is the variable that one would like to predict.
It is important to note that the minumum white wine quality is 3 and the maximum is 9 and there are no 1 or 10 values of quality. The median quality is 6 and the mean is 5.878.
All of the other features have a minimum value greater than 0 except for citric acid. Most pH values fall between 3 and 3.3.
Residual sugar may have an interesting distribution because its 3rd Quartile and its maximum value is relatively high compared to the mean and median.
## 
## min value: 0.08 
## 25th quantile: 0.21 
## median (orange color): 0.26 
## mean (green color): 0.2782411 
## mode (pink color): 0.28 
## 75th quantile: 0.32 
## max value: 1.1 
## IQR value: 0.11 
## skewness value: 1.576014 
## kurtosis value: 5.081904The Volatile acidity is right skewed, that explains the skewness values of 1.57 and it comes with many outliers above the value of 0.45 g/L, while the median is 0.26 g/L, and the minimum value is 0.08 g/L. Its distribution kurtosis is 5.08 which means that the distribution curve shape is a leptokurtic.
## 
## min value: 0 
## 25th quantile: 0.27 
## median (orange color): 0.32 
## mean (green color): 0.3341915 
## mode (pink color): 0.3 
## 75th quantile: 0.39 
## IQR value: 0.12 
## max value: 1.66 
## skewness value: 1.281135 
## kurtosis value: 6.163631The Citric acid is normally distributed to right skewed shape, its skewness value is 1.28. There is a spike in the histogram 0.5 and 0.75 g/dm^3. Additionally, there are several outliers below and above the mean. Moreover, its kurtosis valus is 6.16 which means that its distribution shape is leptokurtic.
## 
## min value: 0.6 
## 25th quantile: 1.7 
## median (orange color): 5.2 
## mean (green color): 6.391415 
## mode (pink color): 1.2 
## 75th quantile: 9.9 
## IQR value: 8.2 
## max value: 65.8 
## skewness value: 1.076434 
## kurtosis value: 3.462415Residual sugar mostly right skewed, its skewness value is 1.28, with most of the data at the first 25th quartile of 1.7 g/dm^3. There are a few outliers above the value of 20g/dm^3. Due to its high right skewness, I will log transform this feature. Finally its distribution shape is leptokurtic because its kurtosis value is 6.16. Due to its high right skewness value it must be transfored using the log transformation.
## 
## min value: -0.5108256 
## 25th quantile: 0.5306283 
## median (orange color): 1.648659 
## mean (green color): 1.480928 
## mode (pink color): 0.1823216 
## 75th quantile: 2.292535 
## IQR value: 1.761907 
## max value: 4.18662 
## skewness value: -0.1610582 
## kurtosis value: -1.352864The above histogram shows that a log transform of residual sugar results in an almost bimodal distribution, with peaks around 0.5 and 2 g/dm^3.
## 
## min value: 0.009 
## 25th quantile: 0.036 
## median (orange color): 0.043 
## mean (green color): 0.04577236 
## mode (pink color): 0.044 
## 75th quantile: 0.05 
## IQR value: 0.014 
## max value: 0.346 
## skewness value: 5.020254 
## kurtosis value: 37.50849Chlorides is right skewed distributed with many outliers above the third quartile of 0.05 g/dm^3, up to a max value of 0.346 g/dm^3. Its distribution shape is leptokurtic with kurtosis valus equal to 37.5 and it is highly reight skewed with skeweness valus equal to 5.02. Due to its high kurtosis value it must be transformed using the log transformation.
## 
## min value: -4.710531 
## 25th quantile: -3.324236 
## median (orange color): -3.146555 
## mean (green color): -3.149011 
## mode (pink color): -3.123566 
## 75th quantile: -2.995732 
## IQR value: 0.3285041 
## max value: -1.061317 
## skewness value: 1.133439 
## kurtosis value: 5.289989Now the log transformation has resulted into an almost normal distribution with skewness value equal to 1.13 which is close to 0. However the outliers still exist.
## 
## min value: 2 
## 25th quantile: 23 
## median (orange color): 34 
## mean (green color): 35.30808 
## mode (pink color): 29 
## 75th quantile: 46 
## IQR value: 23 
## max value: 289 
## skewness value: 1.405883 
## kurtosis value: 11.44751Free sulfur dioxide is right skewed and almost normally distributed, it has skewness value equal to 1.4 with a few outliers above about 75 mg/dm^3. The median is 34 mg/dm^3, and the max value is 289 mg/dm^3. Finallyits distribution shape is leptokurtic with kurtosis value equal to 11.44.
## 
## min value: 9 
## 25th quantile: 108 
## median (orange color): 134 
## mean (green color): 138.3607 
## mode (pink color): 111 
## 75th quantile: 167 
## IQR value: 59 
## max value: 440 
## skewness value: 0.3904706 
## kurtosis value: 0.5685873As for the free sulfur dioxide, its ditribution is almost normal with signs of right skewness (0.39 skewness value) and there is a large range for this feature. The mean is 138.4 mg/dm^3, while the max value is 440 mg/dm^3. Moreover, its distribution shape is almost normal (0.56 kurtosis value).
## 
## min value: 0.98711 
## 25th quantile: 0.9917225 
## median (orange color): 0.99374 
## mean (green color): 0.9940274 
## mode (pink color): 0.992 
## 75th quantile: 0.9961 
## IQR value: 0.0043775 
## max value: 1.03898 
## skewness value: 0.9771742 
## kurtosis value: 9.777368Density is almost normally distributed to right skewed distribution (skewness value equal to 0.977), and it is the feature with the least amount of outliers. The minimum is 0.9871 and the maximum is 1.0390 g/cm^3.
## 
## min value: 2.72 
## 25th quantile: 3.09 
## median (orange color): 3.18 
## mean (green color): 3.188267 
## mode (pink color): 3.14 
## 75th quantile: 3.28 
## IQR value: 0.19 
## max value: 3.82 
## skewness value: 0.4575022 
## kurtosis value: 0.5275677The pH feature is almost normally distributed with a mean of 3.18, and a few outliers below and above the mean.
## 
## min value: 0.22 
## 25th quantile: 0.41 
## median (orange color): 0.47 
## mean (green color): 0.4898469 
## mode (pink color): 0.5 
## 75th quantile: 0.55 
## IQR value: 0.14 
## max value: 1.08 
## skewness value: 0.4870435 
## kurtosis value: -0.6998768Sulphates is slightly right skewed, with a few outliers above the mean of 0.4898, starting at about 0.8 g/dm^3. The minimum value is 0.22 and the max value is 1.08 g/dm^3.
## 
## min value: 8 
## 25th quantile: 9.5 
## median (orange color): 10.4 
## mean (green color): 0.4898469 
## mode (pink color): 9.4 
## 75th quantile: 11.4 
## IQR value: 1.9 
## max value: 14.2 
## skewness value: 0.4870435 
## kurtosis value: -0.6998768The amount of alcohol by volume in a white wine is a flat distribution, with a range of 8 to 14.2% by volume. It is worthy to report that the white wines alcohol by volume has no outliers at all.
Finally lets look the white wines’ quality distribution. The above histogram shows that most of the wines in the dataset are good, with a quality between 5 and 7. Most of the wines have a quality of 6.
The dataset contains 4898 observations of 12 features. All of the features are numeric, and quality is a discrete variable that takes on the integers from 0 to 10.
The main feature of interest is quality, as that is the feature that can be predicted from the others.
I neleive that residual sugar, volatile acidity, and alcohol will help support my exploration with my feature of interest, as the first two features have skewed distributions, and alcohol has a relatively flat (large spread) distribution.
No, I did not.
I log-transformed the feature residual sugar since it was a feature that was highly right skewed.
In this section I will start working with bivariate plots and the correlations between the datasets’ features. The following diagram shows the positive and negative correlations between the white wine dataset’s features
The correlation plot shows that strong correlations exist between the following features: 1. alcohol - quality, strong positive correlation, (0.44 pearson) 2. total.sulfur.dioxide - quality, small negative correlation, (-0.17 pearson) 3. density - quality, negative correlation, (-0.31 pearson) 4. chlorides.log - quality, negative correlation, (-0.27 pearson) 5. fixed.acidity - quality, negative correlation, (-0.11 pearson) 6. volatile.acidity - quality, negative correlation, (-0.19 pearson) 7. alcohol - density, very strong negative correlation, (-0.78 pearson) 8. pH - fixed.acidity, strong negative correlation, (-0.43 pearson) 9. density - residual.sugar, very strong positive correlation (0.84 pearson) 10. alcohol - residual.sugar.log, strong negative correlation (-0.39 pearson) 11. density - residual.sugar.log, very strong positive correlation (0.76 pearson) 12. alcohol - total.sulfur.dioxide, strong negative correlation (-0.45 pearson) 13. density - total.sulfur.dioxide, strong positive correlation (0.53 pearson) 14. free.sulfur.dioxide - total.sulfur.dioxide, strong positive correlation, (0.62 pearson) 15. total.sulfur.dioxide - residual.sugar.log, strong positive correlation, (0.53 pearson) 16. alcohol - chlorides.log, strong negative correlation, (-0.5 pearson)
Lets start exploring one by one all the combinations by ploting them. Especially due to the target is the exploration around the quality of the white wine I will plot some of the relations between the variables.
The above violin plot cannot coclude to a main result. We can say that from quality 6 to 9, it shows that in general, better wines have higher alcohol levels.
The relationship between density and quality seems to be quite strong: higher quality wines from rating of 7 or higher appear to be lower density compared to lower quality wines (quality rating of 5 or lower), based on the large differences in the median density observed between the quality extremes.
It appears that in general, higher quality wines have lower chloride levels, since the median value of chlorides drops when the white wine quality increasing.
It seems that there is a pattern between white wine quality and total.sulfur.dioxide. it appears that higher level of white wine quality has lower amounts of total.sulfur.dioxide. Especially from quality levels from 6 to 9, the median of the distributions of total.sulfur.dioxide show a constant decrease.
It seems that there is a pattern between white wine quality and the different fixed.acidity distributions. it appears that higher level of white wine quality tend to have lower amounts of total.sulfur.dioxide. Especially from quality levels from 3 to 8, the median of the distributions of total.sulfur.dioxide show a constant decrease. There is an exception, the quality with value 9, its fixed.acidity distribution does not follow the notion described above.
The alcohol - density scatterplot exhibits very strong negative correlation (-0.78 pearson), and it is apparent to the graph above.
The above scatterplot shows that there is a negative correlation between wines’ pH and fixed.acidity. This means that as the PH increase the acidity decreases which makes perfectly sense, because as PH increase from 2 to 7 the wine and the solutions in general lose their acidity.
The density - residual.sugar scatterplot exhibit a very strong positive correlation (0.84 pearson).
In addition, the alcohol - residual.sugar scatterplot presents a strong negative correlation (-0.45 pearson), which means that as the level of alcohol increases in wines the amount of residual sugar decreases.
The alcohol percent levels per volume in white wines and total.sulfur.dioxide scatterplot shows a strong negative correlation (-0.45 pearson).
Furthermore, the relation between white wines’ density and total.sulfur.dioxide appears to be correlate positively (with person coefficient equal to 0.53).
The free.sulfur.dioxide - total.sulfur.dioxide scatterplot depicts a positive strong correlation (0.62 pearson).
Lets now plot the scatterplot between residual.sugar loged transformed and the total.sulfur.dioxide, they show a strong positive correlation (0.53 pearson correlation).
Furthermore, another negative correlation between alcohol percent to volume in wine and log transformed chlorides.
From the correlation table, there is a wide variaty of features that correlate either positive or negative for example the following list shows correlations between the dataset’s features:
For the feature of interest which is the white wine quality there were some features that exhibit some correlation and the violin plots indicated its existance. The following list presents which features correlate with white wine quality:
Some peculiar appeared at the correlation table. the feature Fixed acidity has a moderate negative correlation with pH, while volatile acidity does not have appear to have a strong relationship with pH.
The strongest positive relationship that I found was between density and residual sugar.log (0.76 pearson). The strongest negative relationship was between alcohol and density (-0.78 pearson).
In this part of the project I will recreate the previous scatterplots but now a new feature will be introduced. That is the white wine quality which is the feature of interest. It will be a great opportunity to depict not only the relations between 2 features but how the white wine quality react to these relations.
The scatterplot above, shows the negative relation between alcohol levels and desity of water, but there is an additional information, this is the white wines’ Quality levels that is as the alcohol levels increases and the density of the water in sugar content or other substances decreases then the quality of the white wine increases also.
The scatterplot above, shows again another negative relation between fixed.acidity levels and PH levels. However we can not conclude anything about the Quality in respect of the above relation. The white wine Quality distribution on this graph is very flat and no certain conclusions can be made. We may say that exhibits high Quality of wine PH levels close to 2.7 - 2.9 and fixed.acidity close to 10 - 11 levels and as the PH levels increases and the fixed.acidity decreases then the Quality is dropped from higher to lower levels.
The scatter plot above presents the relationship between density and loged transformed residual sugar which shows a positive relationship between these two features. There is one outlier that has a density of about 1.04 g/dm^3, and residual sugar of about 65 g/dm^3. As we see, better rated wines have lower densities and higher sugar.
The scatter plot above depicts the negative relationship between alcohol percentage levels and residual.sugar. It is important to mention that as long as the percentage levels of alcohol increases and the residual sugar gramms per litter decreases then the white wine quality increases.
The scatter plot above shows the negative relationship between alcohol percentage levels and total.sulfur.dioxide. It is important to mention that as long as the percentage levels of alcohol increases and the total.sulfur.dioxide decreases then the white wine quality increases.
The scatter plot above shows the positive relationship between density levels and total.sulfur.dioxide. It is important to mention that as long as the density increases and the total.sulfur.dioxide also increase then the white wine quality decreases. In the plot above it is clear that with smaller amounts of density and total.sulfur.dioxide exhibits higher quality of white wine.
The scatter plot above shows the positive relationship between free.sulfur.dioxide levels and total.sulfur.dioxide. It is deneficial to mention that as long as the free.sulfur.dioxide increases and the total.sulfur.dioxide also increase then the white wine quality also increases. In the plot above it is clear that with smaller amounts of free.sulfur.dioxide and total.sulfur.dioxide exhibits lower levels of quality for white wine which increases as long as the other two features values increase.
Investing the insights from the above scatterplot between total.sulfur.dioxide and residual.sugar.log in respect of the white wine Quality some may say that as long as the values of total.sulfur.dioxide increase and the residual.sugar.log increase the white wine quality drops from 9 to at least 5 to 4.
In addition, the above scatterplot in respect of the white wine quality speaks for itself, the quality of white wine increases as long as the alcohol percent levels increases and the chlorides (log transformed) values decreases.
The strongest relationships are between quality and alcohol and density. Additionally, sugar has a strong negative relationship with density. Since better rated wines have a lower density, they will have less sugar. It seems that sweet wines are not liked as much by the experts.
I found it interesting that sugar and density are positively correlated, since I would not expect that these two features were related.
I will try to create several linear regression models based on the features that correlate strongly between them and with white wine quality:
## 
## Call:
## lm(formula = quality ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.582009   0.098008   26.34   <2e-16 ***
## alcohol     0.313469   0.009258   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5670 -0.5242 -0.0003  0.4881  3.0898 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -22.49170    6.16503  -3.648 0.000267 ***
## alcohol       0.36036    0.01478  24.389  < 2e-16 ***
## density      24.72842    6.07937   4.068 4.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.796 on 4895 degrees of freedom
## Multiple R-squared:  0.1925, Adjusted R-squared:  0.1921 
## F-statistic: 583.3 on 2 and 4895 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6432 -0.5172  0.0023  0.4850  3.0462 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -23.06047    6.15128  -3.749  0.00018 ***
## alcohol         0.33431    0.01565  21.365  < 2e-16 ***
## density        24.95402    6.06493   4.114 3.94e-05 ***
## chlorides.log  -0.19638    0.03959  -4.961 7.25e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7941 on 4894 degrees of freedom
## Multiple R-squared:  0.1965, Adjusted R-squared:  0.196 
## F-statistic: 398.9 on 3 and 4894 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity, 
##     data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6405 -0.5241 -0.0066  0.4780  3.2115 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.32722    6.34530  -5.252 1.57e-07 ***
## alcohol         0.34718    0.01572  22.081  < 2e-16 ***
## density        35.74178    6.28478   5.687 1.37e-08 ***
## chlorides.log  -0.19882    0.03943  -5.042 4.78e-07 ***
## fixed.acidity  -0.08748    0.01404  -6.231 5.02e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7911 on 4893 degrees of freedom
## Multiple R-squared:  0.2028, Adjusted R-squared:  0.2022 
## F-statistic: 311.2 on 4 and 4893 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4471 -0.4952 -0.0352  0.4710  3.1943 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.79075    6.17952  -7.896 3.54e-15 ***
## alcohol            0.39566    0.01539  25.705  < 2e-16 ***
## density           51.64505    6.12476   8.432  < 2e-16 ***
## chlorides.log     -0.14037    0.03819  -3.675  0.00024 ***
## fixed.acidity     -0.10077    0.01357  -7.426 1.31e-13 ***
## volatile.acidity  -2.08196    0.10992 -18.940  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7636 on 4892 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2565 
## F-statistic: 338.9 on 5 and 4892 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4414 -0.4949 -0.0336  0.4705  3.1981 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -48.95916    6.20865  -7.886 3.83e-15 ***
## alcohol            0.39632    0.01557  25.458  < 2e-16 ***
## density           51.89432    6.18825   8.386  < 2e-16 ***
## chlorides.log     -0.14048    0.03820  -3.678 0.000238 ***
## fixed.acidity     -0.10264    0.01509  -6.800 1.17e-11 ***
## volatile.acidity  -2.08414    0.11021 -18.911  < 2e-16 ***
## pH                -0.02296    0.08106  -0.283 0.777041    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7637 on 4891 degrees of freedom
## Multiple R-squared:  0.2573, Adjusted R-squared:  0.2564 
## F-statistic: 282.4 on 6 and 4891 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4342 -0.5004 -0.0296  0.4684  3.2203 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.426e+01  6.490e+00  -6.821 1.02e-11 ***
## alcohol               3.981e-01  1.558e-02  25.558  < 2e-16 ***
## density               4.706e+01  6.487e+00   7.254 4.68e-13 ***
## chlorides.log        -1.493e-01  3.834e-02  -3.893   0.0001 ***
## fixed.acidity        -1.020e-01  1.509e-02  -6.762 1.52e-11 ***
## volatile.acidity     -2.110e+00  1.106e-01 -19.069  < 2e-16 ***
## pH                   -3.459e-02  8.115e-02  -0.426   0.6699    
## total.sulfur.dioxide  7.583e-04  3.070e-04   2.470   0.0135 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7633 on 4890 degrees of freedom
## Multiple R-squared:  0.2582, Adjusted R-squared:  0.2572 
## F-statistic: 243.2 on 7 and 4890 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log, 
##     data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5375 -0.4928 -0.0503  0.4658  3.1049 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.956e+01  1.126e+01   3.512 0.000448 ***
## alcohol               3.048e-01  1.856e-02  16.418  < 2e-16 ***
## density              -3.806e+01  1.138e+01  -3.345 0.000830 ***
## chlorides.log        -9.789e-02  3.845e-02  -2.546 0.010927 *  
## fixed.acidity        -2.789e-02  1.705e-02  -1.635 0.102026    
## volatile.acidity     -2.112e+00  1.097e-01 -19.251  < 2e-16 ***
## pH                    3.214e-01  8.955e-02   3.589 0.000335 ***
## total.sulfur.dioxide  4.676e-04  3.061e-04   1.527 0.126768    
## residual.sugar.log    2.198e-01  2.424e-02   9.068  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7571 on 4889 degrees of freedom
## Multiple R-squared:  0.2705, Adjusted R-squared:  0.2693 
## F-statistic: 226.6 on 8 and 4889 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log + 
##     citric.acid, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5514 -0.4930 -0.0481  0.4668  3.1010 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.049e+01  1.132e+01   3.576 0.000352 ***
## alcohol               3.029e-01  1.870e-02  16.199  < 2e-16 ***
## density              -3.902e+01  1.144e+01  -3.410 0.000654 ***
## chlorides.log        -1.002e-01  3.855e-02  -2.598 0.009394 ** 
## fixed.acidity        -2.992e-02  1.724e-02  -1.736 0.082678 .  
## volatile.acidity     -2.096e+00  1.115e-01 -18.798  < 2e-16 ***
## pH                    3.284e-01  8.997e-02   3.650 0.000265 ***
## total.sulfur.dioxide  4.449e-04  3.074e-04   1.447 0.147887    
## residual.sugar.log    2.213e-01  2.431e-02   9.103  < 2e-16 ***
## citric.acid           7.784e-02  9.615e-02   0.810 0.418248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7571 on 4888 degrees of freedom
## Multiple R-squared:  0.2706, Adjusted R-squared:  0.2692 
## F-statistic: 201.5 on 9 and 4888 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log + 
##     citric.acid + free.sulfur.dioxide, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8335 -0.4928 -0.0372  0.4602  3.1008 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.573e+01  1.133e+01   3.153 0.001625 ** 
## alcohol               3.052e-01  1.866e-02  16.359  < 2e-16 ***
## density              -3.435e+01  1.145e+01  -3.000 0.002715 ** 
## chlorides.log        -9.683e-02  3.846e-02  -2.518 0.011850 *  
## fixed.acidity        -2.145e-02  1.727e-02  -1.242 0.214319    
## volatile.acidity     -1.979e+00  1.136e-01 -17.422  < 2e-16 ***
## pH                    3.403e-01  8.977e-02   3.791 0.000152 ***
## total.sulfur.dioxide -6.650e-04  3.756e-04  -1.771 0.076669 .  
## residual.sugar.log    2.063e-01  2.443e-02   8.446  < 2e-16 ***
## citric.acid           5.901e-02  9.597e-02   0.615 0.538683    
## free.sulfur.dioxide   4.304e-03  8.407e-04   5.119 3.19e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7551 on 4887 degrees of freedom
## Multiple R-squared:  0.2745, Adjusted R-squared:  0.273 
## F-statistic: 184.9 on 10 and 4887 DF,  p-value: < 2.2e-16## 
## Call:
## lm(formula = quality ~ alcohol + density + chlorides.log + fixed.acidity + 
##     volatile.acidity + pH + total.sulfur.dioxide + residual.sugar.log + 
##     citric.acid + free.sulfur.dioxide + sulphates, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8524 -0.4881 -0.0356  0.4611  3.1109 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.578e+01  1.147e+01   3.991 6.68e-05 ***
## alcohol               2.905e-01  1.883e-02  15.426  < 2e-16 ***
## density              -4.452e+01  1.159e+01  -3.841 0.000124 ***
## chlorides.log        -9.833e-02  3.836e-02  -2.563 0.010400 *  
## fixed.acidity        -1.502e-02  1.727e-02  -0.870 0.384525    
## volatile.acidity     -1.950e+00  1.134e-01 -17.195  < 2e-16 ***
## pH                    3.132e-01  8.969e-02   3.491 0.000485 ***
## total.sulfur.dioxide -8.828e-04  3.770e-04  -2.342 0.019236 *  
## residual.sugar.log    2.280e-01  2.473e-02   9.221  < 2e-16 ***
## citric.acid           3.870e-02  9.581e-02   0.404 0.686319    
## free.sulfur.dioxide   4.380e-03  8.387e-04   5.222 1.84e-07 ***
## sulphates             5.048e-01  9.847e-02   5.126 3.07e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7532 on 4886 degrees of freedom
## Multiple R-squared:  0.2784, Adjusted R-squared:  0.2767 
## F-statistic: 171.3 on 11 and 4886 DF,  p-value: < 2.2e-16I fitted several linear models to my dataset. The purpose is to predict the quality of the white wine based on the other features in the data. It includes features that have some correlation with the level of quality and features that correlate between them.
As I added additional features to the data the model grew steadily stronger. However, The R^2 values were low and that means that the several models that were built, do not reflect and characterize this data.
In order to improve the model, I may need additional data, study and feature engineer new features. Probably there is likely a better method than linear regression for this prediction model.Another approach can be to fit a unique linear model to different portions of the quality range.
Most of the wines in the dataset have a quality of 6. Moreover, the quality of white wine is normally distributed, with most of wines having a rating between 5 and 7.
The above scatterplot presents the relation between alcohol and chlorides (which is log transformed) in respect of the white wine quality and it speaks for itself, the quality of white wine increases as long as the alcohol percent levels increases and from the other hand the chlorides (log transformed) values decreases.
This violin and boxplot depicts the strong relationship between alcohol and quality (0.44 pearson).
From this exploratory analysis, we observed that good wine tends to have more alcohol levels, lower density, lower chlorides, volatile and fixed acidity and total sulfur dioxide. Due to the fact that density increases while sugar increased, sugar might be a bad factor for the flavor of wine, while alcohol is good for the flavor of wine. This analysis is based on correlation, so it does not imply any causation between the datasets’ features.
Limitations of the analysis are that the dataset is for white wines from a specific area, so the relationship between the variables might not correspond for different types of wine. Additionally, the quality for the wines might vary from place to place. To study the data even further, we could fetch more data for white wines from other regions all over the world.
References: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt