Abstract
In this project, I apply SLR to real world data through the use of R. The specific data in which I will analyze was found in Mendenhall and Sincich’s textbook, Statistics for Engineering and the Sciences, 6th edition. The data has to do with the relationship between levels of pectin in oranges and the sweetness index of the juice produced.Pectin is a naturally occuring substance found in the cell walls of many citric fruits, including oranges. The pectin contained in an orange lies inside the white-colored membrane which surrounds the fruit. When making orange juice, manufacturers must take into consideration numerous sensory and chemical components that combine to make the best-tasting orange juice. (Mendenhall 2016) Knowing how sweet the juice is without needing to taste the contents of a bottle or carton can be very valuable. In order to look for a relationship, I will attempt to build a model based on the data collected, and use statistical parameters to measure the accuracy of the model. If the theory holds then we know it is possible to accurately know how sweet a batch of orange juice is based on the level of pectin it contains.
As a consumer, I always like to know a little bit about the processes and work that goes into making the products on the supermarket shelves. Seeing the data on orange juice represents a key opportunity to delve into the orange juice world a bit and gain some knowledge on what the manufacturers are looking at as far as using statistics to increase the quality of something I consume regularly.
The data consists of results from 24 production runs where a manufacturer looked at two variables. The variables were Pectin(parts per million) and SweetIndex, a quantitative measure of sweetness.
options(warn=-1)
library(DT)
OJ = read.csv("OJUICE.csv")
datatable(OJ)
The question I hope to answer is whether or not SLR can be used by Juice manufacturers as an accurate way of determining how sweet their oranges juice will be, using level of water-soluble Pectin found in the juice. If the model is accurate, it can offer manufacturers a way of testing their juice to make sure what they are sending out is of correct SweetIndex without having logistical issues getting in the way, like say pulling sample bottle of juice out that may be outliers and not offer a good approximation for the entire population in some instances. The pectin can be measured before the juice even gets bottled.
library(ggplot2)
g = ggplot(OJ, aes(x = Pectin, y = SweetIndex)) + geom_point()
g = g + geom_smooth(method = "loess")
g
In a first glimpse of the data, we see that the data could be problematic, as there are many points that do not fall in the shadded area, and that while we do get a bit of a trend out of the data we will have to verify the assumptions before progressing with the use of a SLR modeling.
I am hoping to make models which will capture the trend that we can sort of roughly make out in the preliminary plot. I will make a hypothesis, based on this preliminary plot, that as the levels of Pectin(the X value) increase, the SweetIndex(the Y value) of orange juice will decrease. I will model the observed trend. The SLR models will be represented by a sort of, “best fit” line, running through all of the data points. In the models, Pectin will be the independent variable, while SweetIndex will be the dependent variable. Below is a mathematical representation for the models.
\[ y = \beta_0 + \beta_1x_i + \varepsilon_i \]
In this equation, \(\beta_0\) and \(\beta_1\) are both unknown parameters. The mean value of y is represented by the \(\beta_0 + \beta_1x_i\) part of the equation while the \(\varepsilon\) represents the random error. This error adresses the fact that we know all of our data points won’t be directly on the line representing the model.
The line observed in the preliminary plot is by no means, straight and we can see that this is because the individuals data points do not lie in a way that would respresent a straight line when connected. Therefore, we know that there will be errors when making the line for our SLR models. However, as stated before, we can make a “best fit line” that represents our models will also minimizing the errors that will natrually come from the data when trying to run a straight line through it. Our first assumption we need to make in order to perform our SLR analysis is to verify that \(E(\varepsilon) = 0\). In the process of creating a line which minimizes our errors we will notice that if the line is straight through the middle of the data, this mathematical expression will hold true, as close to half of the errors will lie above our line and the other half will lie below the line, resulting in a mean of 0 when it comes to the error. From that, we are then able to say the following:
\[E(y) = \beta_0 + \beta_1x_i + E(\varepsilon_i)\] \[E(y) = \beta_0 + \beta_1x_i\]
The SLR models will be considered probalistic, meaning we will be able to derive predicted SweetIndex(y) values form it for say, a sweetIndex at an x level of pectin that is not actually given to us by a data point. The above equation when all of the possible values of x are plugged in, will represent our “best” fit line.
Since we are wanting to use SLR models, there are assumptions we need to make so that we can estimate our unknowns, \(\beta_0\) and \(\beta_1\). Our assumption will be dependent on the robability distribution of \(\varepsilon\). The assumptions are as follows:
1.) The mean of the probability distribution of \(\varepsilon\) is 0.
2.) The variance of the probability distribution of \(\varepsilon\) is constant, \(\sigma^2\) per say, for all values of x.
3.) The probability distribution of \(\varepsilon\) is normal.
4.) The errors associated with any two different observations are independent.
To make the line of “best fit” for our models that i’ve been speaking about, we are going to use the methods of least squares. Using this method is a way that we can obtain the unknown \(\beta_0\) and \(\beta_1\) values for which we need to complete our equation for our line which will minimize the sum of the squared residuals(RSS). Now I will explain how this works mathematically:
The RSS is given by, \(\sum_{i=1}^n{r_i^2}\) = \(\sum_{i=1}^n{(y_i-\hat{y_i})^2}\) and from was previously mentioned, we know that we can use the following estimator for y: \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}x_i\) Therefore, we then have: \(\sum_{i=1}^n{[y_i-(\hat{\beta_0} + \hat{\beta_1}x_i)]^2}\)
OJ.lm = with(OJ, lm(SweetIndex~Pectin))
summary(OJ.lm)
##
## Call:
## lm(formula = SweetIndex ~ Pectin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54373 -0.11039 0.06089 0.13432 0.34638
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2520679 0.2366220 26.422 <2e-16 ***
## Pectin -0.0023106 0.0009049 -2.554 0.0181 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.215 on 22 degrees of freedom
## Multiple R-squared: 0.2286, Adjusted R-squared: 0.1936
## F-statistic: 6.52 on 1 and 22 DF, p-value: 0.01811
In the summary we can see that: \(\hat{\beta_0}\) = 6.2520 and \(\hat{\beta_1}\) = -0.0023
We now have the constants which we needed to complete the equation of our SLR line. To expand a bit on what these newly found variables respresent, the \(\beta_0\) tells us that the y-intercept for our line is at 6.2520 and \(\beta_1\) tells us that for every 1(parts per million) Pectin in the juice, the SweetIndex decreases by 0.0023.
library(s20x)
ciReg(OJ.lm, conf.level = 0.95)
## 95 % C.I.lower 95 % C.I.upper
## (Intercept) 5.76134 6.74279
## Pectin -0.00419 -0.00043
The above output shows us the interval for which our we are 95% confident within. And now we can write the equation for the linear model using our beta values as:
\[ y = (6.2520) + (-0.0023)x \]
We will now look at some figures so that we can visualize the linear model we created using our data in combination with the equation for our “best fit” line we gained from least squares method. We will also run tests to check that our model is valid and that it agrees with the assumptions for SLR.
plot(SweetIndex~Pectin, bg = "blue", pch = 21, cex = 1.2, ylim = c(4.5,1.1*max(SweetIndex)), xlim=c(150,1.1*max(Pectin)), main = "Fitted Line for SweetIndex vs Pectin", data = OJ)
abline(OJ.lm)
The above line comes from our linear model for the data. As we can see the model cuts straight through the middle of all the data points which tells us that our Least Squares method did pretty well for us. However, most of the data points seem to have a fairly high residual.
plot(SweetIndex~Pectin, bg = "blue", pch = 21, cex = 1.2, ylim = c(4.5,1.1*max(SweetIndex)), xlim=c(150,1.1*max(Pectin)), main = "Residuals for SweetIndex vs Pectin", data = OJ)
abline(OJ.lm)
yhat = with(OJ, predict(OJ.lm, data.frame(Pectin)))
with(OJ,{segments(Pectin,SweetIndex,Pectin,yhat)})
abline(OJ.lm)
This gives us the ability to notice which values of x along the linear model line have a high residual, and from this we might be able to see that some points don’t follow the same trend that the others do. This plot also allows us to see visually, how well our linear model actually works in predicting the existing y values for each x. In creating this graph we are also able to obtain the RSS which we can use later in our analysis.
plot(SweetIndex~Pectin, bg = "blue", pch = 21, cex = 1.2, ylim = c(4.5,1.1*max(SweetIndex)), xlim=c(150,1.1*max(Pectin)), main = "Mean and Residuas for SweetIndex vs Pectin", data = OJ)
abline(OJ.lm)
yhat = with(OJ, predict(OJ.lm, data.frame(Pectin)))
with(OJ,{segments(Pectin,SweetIndex,Pectin,yhat)})
abline(OJ.lm)
with(OJ, abline(h=mean(SweetIndex)))
abline(OJ.lm)
with(OJ, segments(Pectin,mean(SweetIndex),Pectin,yhat,col = "Red"))
This plot offers us the MSS which again will be used for later analysis. The plot here displays how each x value of our data’s cooresponding y value based on the trendline deviates from the mean level of SweetIndex. We can use this as a factor for determining how our model does in comparison to the mean.
plot(SweetIndex~Pectin,bg="Blue",pch=21,cex=1.2,
ylim=c(4.5,1.1*max(SweetIndex)),xlim=c(150,1.1*max(Pectin)),
main="Deviation From the Mean for SweetIndex vs Pectin", data=OJ)
with(OJ,abline(h=mean(SweetIndex)))
with(OJ, segments(Pectin,SweetIndex,Pectin,mean(SweetIndex),col="Green"))
TSS represents the sum of the squares of the differences between the dependent variable and its mean. The green line segments represent the differences and once all the data is plotted that allows for us to obtain the final value we wanted to find, the TSS. Now we will analyze the values we calculated in more depth.
TSS = sum( (OJ$SweetIndex - mean(OJ$SweetIndex) )^2 )
TSS
## [1] 1.318333
MSS = sum( (yhat - mean(OJ$SweetIndex) )^2 )
MSS
## [1] 0.3014019
RSS = sum( (OJ$SweetIndex - yhat)^2 )
RSS
## [1] 1.016931
MSS/TSS
## [1] 0.2286234
We know that \(R^2\) is equal to \(\frac{MSS}{TSS}\). So our \(R^2\) value is 0.2286 which is in fact the \(R^2\) value given to us by the summary of the model earlier. This value actually relatively low. We are looking for a number closer to 1 which would represent a perfect fitted trend line. This tells me that the linear model may not be very good fitting line for the data.
trendscatter(SweetIndex~Pectin, f = 0.5, data = OJ, main = "SweetIndex Vs Pectin")
This trendscatter plot outlines the region for a line of best fit and outlined by a region of error. This data does not appear to offer a way in which a linear line could be made to accurately make predictions for the future.
SweetIndex.res = residuals(OJ.lm)
SweetIndex.fit = fitted(OJ.lm)
Above we set up the residual and fitted values for plotting.
plot(OJ$Pectin,SweetIndex.res, xlab="Pectin",ylab="Residuals",ylim=c(-2*max(SweetIndex.res),2*max(SweetIndex.res)),xlim=c(200,1.6*200), main="Residuals values vs Pectin")
The Residuals are fairly symmetrical across the x-axis as far as the number of points on top and bottom, however the points which lie below the 0 appear to be farther away from the 0 than do those above the 0.
trendscatter(SweetIndex.res~SweetIndex.fit, f = 0.5, data = OJ.lm, xlab="Fitted Values",ylab="Residuals",ylim=c(-1.1*max(SweetIndex.res),1.1*max(SweetIndex.res)),xlim=c(5.3,1*max(SweetIndex.fit)), main="Residuals Vs Fitted Values")
Generally, we would look for a straight line running through the data with even number of data both above and below the line, here we see that is not the case and raises a bit of a red flag concerning the linear models effectiveness due to what seems like a non-constant variance.
normcheck(OJ.lm, shapiro.wilk = TRUE)
After running a shapiro wilk test on the OJ.lm, we can see that histogram on the right is skewed to the left. Our residuals for SweetIndex and Pectin return a p-value of 0.455 which is greater than 0.05, and therefore we do not have proof against the, that the residuals are normally distributed. Therefore may assume that the data are normally distributed and that the NULL Hypothesis we need to be true when performing SLR, \(\varepsilon \sim N(0,\sigma^2)\), is in fact something we do not have evidence to reject based on this Shapiro Test. However, after looking at the \(R^2\) value and Residuals Vs Fitted plot, I am beginning to feel that the current model is not sufficient for making accurate predictions. Because of this, we will attempt to model the data another way.
We will now try to implement a quadratic model which we hope will more accurately represent the trend of the data. Adding more variables to the equations is a method often used to try and build a better model for the data.. The equation for the quadratic will be the following:
\(y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2\)
quad.lm = lm(SweetIndex~Pectin + I(Pectin^2),data = OJ)
plot(SweetIndex~Pectin, bg = "Blue", pch = 21, cex = 1.2, ylim = c(4.5, 1.1*max(SweetIndex)), xlim = c(150,1.1*max(Pectin)), main = "SCatter Plot and Quadratic of SweetIndex Vs Pectin", data = OJ)
myplot = function(x){quad.lm$coef[1] + quad.lm$coef[2]*x + quad.lm$coef[3]*x^2}
curve(myplot, lwd = 2, add = TRUE)
The quadratic has some definite curvature to it. This mean that it may be a better fit than the linear model since the data itself acutally pulled the quadratic model to curve.
SweetIndex.resQ = residuals(quad.lm)
SweetIndex.fitQ = fitted(quad.lm)
Above we create the new Residuals and Fitted values for the Quadratic model.
trendscatter(SweetIndex.resQ~SweetIndex.fitQ, f = 0.5, data = quad.lm, xlab="Fitted Values",ylab="Residuals",ylim=c(-1.1*max(SweetIndex.resQ),1.1*max(SweetIndex.resQ)),xlim=c(5.35,1*max(SweetIndex.fitQ)), main="Residuals Vs Fitted Values")
This new plot of residuals vs fitted seems to do much better than the linear model, as the line is much closer to being straight along the x-axis with a good distribution of residuals above and below the mean. I believe that the addition of more data to this quadratic could start to give us a model which could be very useful. However on the left because of the lack of data, the trendscatter plot has a great range of possible error shown by the lines traced in red.
normcheck(quad.lm, shapiro.wilk = TRUE)
We can see that our p-value went up in comparison to the linear model’s p-value. The result indicates that there is no reason to reject the NULL Hypothesis and that we can continue to assume that \(\varepsilon \sim N(0,\sigma^2)\) continues to hold true.
summary(quad.lm)
##
## Call:
## lm(formula = SweetIndex ~ Pectin + I(Pectin^2), data = OJ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56616 -0.10105 0.03277 0.11135 0.36976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.058e+00 1.337e+00 5.278 3.12e-05 ***
## Pectin -7.898e-03 9.168e-03 -0.861 0.399
## I(Pectin^2) 9.212e-06 1.504e-05 0.613 0.547
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2181 on 21 degrees of freedom
## Multiple R-squared: 0.2422, Adjusted R-squared: 0.17
## F-statistic: 3.355 on 2 and 21 DF, p-value: 0.05439
The summary reports the following: \(\beta_0 = 7.058e+00\) \(\beta_1 = -7.898e-03\) \(\beta_2 = 9.212e-06\)
ciReg(quad.lm, conf.level = 0.95)
## 95 % C.I.lower 95 % C.I.upper
## (Intercept) 4.27692 9.83871
## Pectin -0.02696 0.01117
## I(Pectin^2) -0.00002 0.00004
Above we can see the Interval for which we can be 95% confident in the model, and from the summary we can build the literal mathematical expression representing quad.lm as:
\[ y = (7.058e+00) + (-7.898e-03)x + (9.212e-06)x^2 \]
Now we will compare the models by the predictions they make, their \(R^2\) values and with the plots we made for each throughout the project. ## Comparing Predictions both Models
Linear Model Predictions:
ResultsL = predict(OJ.lm, data.frame(Pectin = c(100,150,200,250,300)))
ResultsL
## 1 2 3 4 5
## 6.021005 5.905474 5.789943 5.674411 5.558880
Quadratic Model Predictions:
ResultsQ = predict(quad.lm, data.frame(Pectin = c(100,150,200,250,300)))
ResultsQ
## 1 2 3 4 5
## 6.360152 6.080414 5.846737 5.659123 5.517569
In our prediction results, the quadratic was able to pick up on the key fact. This being that the trend of the data seemed to represent a steeper increase in SweetIndex once the data got to a certain level of pectin (around 300). From there, reduction of Pectin offered a noticeable increase in the SweetIndex.
summary(OJ.lm)$r.squared
## [1] 0.2286234
summary(OJ.lm)$adj.r.squared
## [1] 0.1935609
summary(quad.lm)$r.squared
## [1] 0.242162
summary(quad.lm)$adj.r.squared
## [1] 0.169987
Above, we can see both the \(R^2\) and Adjusted \(R^2\) values for both models. \(R^2\) is a number used to describe how well a particular model fits a set of data. Neither of the models returned a very high \(R^2\) value meaning that neither of them represent the data very well. We can see, the quadratic model actually had a higher \(R^2\) value than the linear model which might lead one to believe that the quadratic model is an improvement over the linear model. However. that is why we have \(R^2\) adjusted.
Adding irrelevant predictor variables to the regression equation often increases \(R^2\). To compensate for this one can define an adjusted coefficient of determination, \(R^2\) adjusted (see Sheather 2009). Looking at our results we see that the \(R^2\) adjusted value actually decreased from the linear to the quadratic model, meaning that the predictor variable we added was actually an irrelevant variable that falsely raised the \(R^2\) value of the quadratic model. From thise analysis of the \(R^2\) and \(R^2\) adjusted values, we can determine that while neither of the models were very accurate in giving us reliable predictions about the future, the linear model has a slight edge over the quadratic model due to it’s higher \(R^2\) adjusted value.
We can also check our conclusions against what R concludes.
anova(OJ.lm,quad.lm)
Our RSS for the linear model as we can see is a bit higher than that of the quadratic model but by a very small difference.
cooks20x(OJ.lm)
As we can see, the cook’s plot has assigned a cook’s distance to each data point in the dataset, showing how much each specific data point would affect the regression analysis if it was removed. The greater the Cook’s distance, the more of an affect on the regression analysis the data point has. Taking out data point with high cooks distance may result in a better model due to less outliers possiby distorting the model. This is made clear in Chatterjee (2006)’s book where he discusses Cooks Distance with the Distance being represented by \(C_i\), “On the other hand, if there are data points with Ci values that stand out from the rest, these points should be flagged and examined. The model may then be refitted without the offending points to see the effect of these points”. Cook’s distance plot for OJ.lm highlights the 1st data point in this case.
OJcooked = OJ[-1,]
OJcooked.lm = lm(SweetIndex~Pectin, OJcooked)
summary(OJ.lm)$adj.r.squared
## [1] 0.1935609
summary(OJcooked.lm)$adj.r.squared
## [1] 0.3323824
After removing the problematic data point from the first row of the data, we see that our \(R^2\) adjusted value increased greatly. We can see that the data point was throwing off our model from being able to model the trend well.
quadcooked.lm = lm(SweetIndex~Pectin + I(Pectin^2), OJcooked)
summary(quadcooked.lm)$adj.r.squared
## [1] 0.3542288
summary(quadcooked.lm)$r.squared
## [1] 0.4129352
Now building the quadratic model form the data with the high cook’s valued data point removed, we were able to obtain a higher value for both the \(R^2\) and \(R^2\) adjusted for the quadratic model. This tells us, that once the outlier point was removed from the data set, number of variables aside, the quadratic model is the better model between the two we built, and it has an \(R^2\) adjusted value of 0.413. The Cook’s plot helped us greatly to make the models to better fit the data set.
In conclusion, the quadratic model is a better a better fit for the data with the outlier removed while the linear model is a better fit for the raw data, but only slightly. However, neither of the models are signifigant predictors of SweetIndex given Pectin since the highest \(R^2\) value achieved was 0.413 when we were looking for a value closer to 1.
Based on The results of the SLR models made, we are not able to accurately predict levels of SweetIndex using Pectin. I do believe that with additional data however, that a reliable SLR model could be made and used by juice manufacturers so that they can get a good idea of how sweet the juice that their customers are getting is. With a reliable SLR model and the ability to monitor pectin levels, manufacturers may even be able to come up with ways to reduce or increase pectin in order to get the perfect amount of sweetness in their juice.
There are still more ways to improve our models and the experiment. When it comes to the model, taking out a few more outlier-type data points might be helpful, as well as possibly adding more predictor variables in a more complex model. I believe the greatest trouble with trying to make a good model for the data set was the relatively low number of data points. Typically we would shoot for a data set of at least size 30, and the more the better.
Chatterjee, Samprit. 2006. Regression Analysis by Example. 4th ed.. Wiley Series in Probability and Statistics. Hoboken, N.J.: Wiley-Interscience.
Mendenhall, William M. 2016. Statistics for Engineering and the Sciences / William M. Mendenhall, Terry L. Sincich. Sixth edition..
Sheather, Simon. 2009. A Modern Approach to Regression with R. Vol. 2. Springer Texts in Statistics. New York, NY: Springer New York.