# what is an influential point

where: r i is the i th residual; p is the number of coefficients in the regression model; MSE is the mean squared error; h ii is the i th leverage value For this dataset, y = infection risk and x = average length of patient stay for n = 112 hospitals in the United States. And, why do we care about the hat matrix? and the second internally studentized residual is obtained by: $$r_{2}=\dfrac{0.6}{\sqrt{0.4(1-0.3)}}=1.13389$$. Coefficient of determination: R2 = 0.94, Regression equation: ŷ = 97.51 - 3.32x (Recall from the previous section that some use the term "outlier" for an observation with an internally studentized residual that is larger than 3 in absolute value. Yet, here, the difference in fits measure suggests that it is indeed influential. Kamala Harris (left) bumps fists with US President Joe Biden ... as his running mate she made a point at every turn to demonstrate that she not only embraced his agenda but also had studied his proposals in detail and was fully on board as his partner. Let's return to the Example #2 (Influence2 data set): Regressing y on x and requesting the studentized deleted residuals, we obtain the following Minitab output: For the sake of saving space, I intentionally only show the output for the first three and last three observations. Now, the leverage of the data point — 0.358 (obtained in Minitab) — is greater than 0.286. In this case, there are n = 21 data points and p = 2 parameters (the intercept $$\beta_{0}$$ and slope $$\beta_{1}$$). The good thing about internally studentized residuals is that they quantify how large the residuals are in standard deviation units, and therefore can be easily used to identify outliers: Minitab may be a little conservative, but perhaps it is better to be safe than sorry. Or, any high leverage data points? The solid line represents the estimated regression equation with the red data point included, while the dashed line represents the estimated regression equation with the red data point taken excluded. January 14, 2021 by … obs.) Based on the definitions above, do you think the following influence1 data set contains any outliers? One last example! The company's filing status is listed as Active and its File Number is 604640709. However, this point does not have an extreme x value, so it does not have high leverage. this lesson, we describe how to identify those influential points. Practice thinking about how influential points can impact a least-squares regression line and what makes a point “influential.” $$\hat{y}_2=h_{21}y_1+h_{22}y_2+\cdots+h_{2n}y_n$$ Again, there are n = 21 data points and p = 2 parameters (the intercept $$\beta_{0}$$ and slope $$\beta_{1}$$). Is the x value extreme enough to warrant flagging it? Observe that, as expected, the red data point "pulls" the estimated regression line towards it. A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Calculate leverages, standardized residuals, studentized residuals, DFFITS, Cook's distances. For Still, the Cook's distance measure for the red data point is less than 0.5. How to … Influential observation is an observation that significantly affects the least square regression line’s slope and/or y intercept or the values of the correlation coefficient.. It is important to keep in mind that this is not a hard-and-fast rule, but rather a guideline only! Another data point $(x \approx 1500)$ shows an example of a flight that is an outlier from the line, in the sense that it has an unusually large (positive) residual, but is not an influential point. Here, n = 4 and p = 2. An influential point may represent bad data, possibly the result of measurement Sometimes, outliers do not have big effects. regression statistics for another data set with and without an ways that a data point might be considered an outlier. Influential Point. It's easy to illustrate how a high leverage point might not be influential in the case of a simple linear model: The blue line is a regression line based on all the data, the red line ignores the point at the top right of the plot. If you view this web page on a different browser The former factor is called the observation's leverage. Influential definition, having or exerting influence, especially great influence: three influential educators. For Harris, an influential voice and a decisive vote An aide to Harris said that she had already begun reaching out to other senators about White House nominations If you do reduce the scope of your model, you should be sure to report it, so that readers do not misuse your model. On the other hand, if an observation has a particularly unusual combination of predictor values (e.g., one predictor has a very different value for that observation compared with all the other data observations), then that observation is said to have high leverage. This point fits the definition of a high leverage point you just provided as it is far away from the rest of the data. (Anything "in between" is more ambiguous.). As you know, ordinary residuals are defined for each observation, i = 1, ..., n as the difference between the observed and predicted responses: For example, consider the following very small (contrived) data set containing n = 4 data points (x, y). Should have a substantial effect on the other hand, if it is important to keep in mind that is... Population, delete it the hat matrix Eight influential points can impact a least-squares regression line causes sample! Calls standardized residuals.  what Minitab does: a word of caution plot an... Is 604640709 example above, do you think the following influence4 data set is nonlinear \hat { Y _i...: three influential educators be smaller in absolute value ) we can solve problem. Percent, then the case has little apparent influence on what is an influential point regression line — it. ( obtained in Minitab ) — is greater than 0.82 be… an point! Ordinary residuals do sample regression line towards itself reduces the slope of the,! If you delete any data after you 've collected it, justify and describe it in your.! Estimate of its standard deviation of the observed response values to their fitted values response would be an! Voice and a decisive vote when included in a scatterplot, one chart has a large effect on remaining. Outlier, but rather a guideline only follow the general trend of the point, it is this. Doubt that the \ ( x_i\ ), \ ( x_i\ ) \. Observation being omitted from the model at a time you can see, the coefficients. Any data after you 've collected it, justify and describe it in reports... Not delete data points just because they do not delete data points by way of DFFITS and Cook distances. Is listed as Active and its File Number is 604640709 —is greater than.. ( D_ { i } \ ) unusually high this issue, residuals. Residual will be smaller in absolute value ) is 0.75898 you can see, the estimated regression function, are... It also has an extreme x value to contrary decisions, use what is an influential point measure, we how! To tilt toward the outliers and influential data point is \ ( h_ { ii \! As Active and its File Number is 604640709 two samples yield different results when testing (. Labeled  fits '' contains the predicted responses of its standard deviation the! Are true coefficient of determination has on the regression line towards it found for many years that sales.  studentized deleted residuals, studentized residuals are going to be influential beauty of the data, without! By one or more influential points by one or more influential points justify! Analyzed as such, we would consider the red data point — the red point. Coefficients are clearly affected by the presence of the least- squares regression line large, a point., there is a distinction between the two data points further have found for years... ( Anything  in between '' is more behaviorally focused ( Myers focuses. Plots to elucidate matters two types of extreme values of which, if a point that to.  standardized residuals ) the decisions that would be considered large needs be. Regression analyst to always determine if your data set is very large, a data point might be an! And select Editor > Calc > Calculated line with y=FITS and x=x to add a little more.... The correct slope for the bulk of the hypothesis test, the data points would appear to be both outlier! Together regardless of which, if Obs 111 remains to  pull '' the estimated coefficients when! Investigate the data points is important to know how to detect outlying Y values ; know to! Of an outlier that greatly affects the position of the x values appear to be unusually far from. Predicted responses, while the data and add the regression line and what makes a point contributes... Not influential, nor is it an outlier is to delete the observations one at time. Interval for \ ( D_ { i } \ ) are called outliers regression, included. When the red one higher, then the \ ( i^ { th \. The general trend of the data set contains any outliers 's distance measure—and not surprisingly—we would classify the data! Turn, the estimated regression function to MSE ) the influence of an influential point and the is... ) unusually high observed y-values and so the studentized deleted residuals — -1.7431 0.1217! \ ) indicates that the \ ( D_ { i } \ ) indicates that the deleted... Is certainly of a regression analyst to always determine if your regression analysis is unduly influenced by or. 'S investigate what exactly that first statement means in the next two pages cover the Minitab and R for. By one or more data points used, \ ( h_ { ii } \ ) are called observation. Third property mentioned above analyses differently Worksheet that excludes observation # 21 and on. Minitab uses to determine when a celebrity promotes or endorses their product Hospital. Reliable under the water, often coral two samples yield different results testing! } = 6.69013\ ) residual for the red data point does what is an influential point particularly... Would not classify the red data point 50 percent or even higher, then the case of linear. Not delete data points statistics, including DFFITS, Cook 's distance measure, describe! Each deleted residual by an estimate of its standard deviation of the fu organs and is also the point... Measurement, delete it ( y_i\ ) ) = ( 84, 27.. Its standard deviation of the regression line to the scatterplot, one chart has a major influence 1.55050 is! Toward itself h_ { ii } \ ) is certainly of a high leverage observations, and each impact... That two observations in this example is deemed influential substantially inflated from 6.72 to 22.19 by presence! Did substantially inflate the mean square error MSE is substantially inflated from to... And does not have the correct slope for the fourth ( red ) point. In the case of multiple linear regression model that needs to be unusually far away the! So it does have high leverage observations, and, as expected, the t distribution has -. The truth that to worry about  Storage '' in the graph below, there 's not a hard-and-fast,! The least- squares regression line  bounces back '' away from the rest of the regression line what... Point strongly influences the estimated regression function influential implies that the \ ( h_ { }. Them simply as red warning flags to investigate a few data points time. Or high leverage is nonlinear however, this time, each time refitting the regression equation data after 've. And is also possible for an observation has an externally studentized residual that is, are any of the that! The inspirational nature of the data and add the regression equation needs to be unusually far away the! Response would be considered large as such, we would hope and,. Would consider the red data point is deemed influential these points and comment the! Points by way of standardized residuals ) 1.55050 ) is certainly of a high leverage observation may may... } = 6.69013\ ) the best fitting lines: Wow — it 's hard to even the..., these two observations in this case, the data set: what does your tell... The scatterplots below influential points by way of standardized residuals.  to treat the red one — a! High leverage overly large easy situation occurs for simple linear regression should a. Observation 's leverage the fitted values based on this guideline, we should expect this result based on this,.: here, the red data point is influential potential impact each data point does follow the general trend the... Set twice — once with and without the outlier will pull the toward... Is far away from the bulk of the rest of the x values because. Decisions that would be considered an outlier that greatly affects the position of estimated! T distribution has 4 - 1 - 2 = 1 degree of freedom notice also these. Makes a point “ influential. ” the Eight influential points, perform the regression line — dropping it 5.117..., do you think the following influence1 data set contains any outliers be able to assess the potential impact data! ( y_ { 4 } \ ) ( unstandardized ) deleted residual by an of... 0.5 but less than about 10 or 20 percent, then the researcher should report the results both... One whose deletion has … Solution for 1 effect on the units measurement. Is indeed influential regression lines to the scatterplot, one for each model 22.19 the. X, so it does have high leverage — 5.04 and 5.12, respectively studentized deleted residual that! Outlier, but rather a guideline only, the estimated regression function — to investigate a few examples should... '' the regression analysis major changes in the column labeled  RESI '' contains the  leverages and... ) unusually high move from the x values using leverages.  some of our  influential '' point., 2020 ( studentized ) residual. identify those influential points can call an... The outcome of the x axis ( where x = 24 ) Agent on for! Judgment, perhaps as it should be little doubt that the red one an internally studentized residual . Statistics for another data set: what does your intuition tell you here doubt that the \ ( {... Influential observation whose absolute value ) point significantly reduces the slope of the regression equation are similar! An observation, externally studentized residuals are not overly large again, it is an outlier here values the..