Author Topic: Goodness of Fit (Read 3661 times) Tweet Share

jello7 · « **on:** July 03, 2012, 04:41:59 pm »

Hey guys,

Just wondering if anyone knows how to quantitatively distinguish how well a curve fits data points, so say I had a set of data points and I had two curves, how could i distinguish which curve models the data better. I think its called goodness of fit and I do something called 'residual mean square' (not completely sure if this is it) but I've googled it and dont really understand how to use it.

At the moment, I'm just calculating all the values the curve would give in respect to an x-value (an x-value of the actual data), taking the absolute value (by just ignoring the sign), and then comparing it to the actual data points. However, this method seems inefficient and I'm not sure if it actually gives a 'goodness of fit'. Is there online calculator or a straight-forward formula that can be used, and is the method I'm using correct?

Thanks in advance everyone

TrueTears · « **Reply #1 on:** July 03, 2012, 05:18:34 pm »

Hi there, we normally measure goodness of fit using R^2 (or Adjusted R^2) and the concept of SST, SSR and SSE

There is also a wiki article you should have a read of however it is not very comprehensive. http://en.wikipedia.org/wiki/Total_sum_of_squares

I will try explain this in the following:

Now we can think of each observation as being made up of an explained part and an unexplained part. Thus $y_i = \hat{y}_i+\hat{u}_i$ where $\hat{u}$ is the error term in the regression.

Thus we can define the explained part as $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i$ and the unexplained part as $\hat{u}_i$ where all the terms that contain a $\hat{}$ are the sample estimates of the population parameters.

Thus we have $y_i = \hat{\beta}_0 + \hat{\beta}_1x_i + \hat{u}_i$

Let us define $SST = \sum (y_i - \bar{y})^2$

$SSE = \sum (\hat{y}_i - \bar{y})^2$

$SSR = \sum \hat{u}_i^2$

It may not be immediately obvious but SST = SSE + SSR, the proof is given:

$\sum (y_i - \bar{y})^2 = \sum \left[(y_i - \hat{y}_i)(\hat{y}-\bar{y})^2\right]$

$= \sum \left[\hat{u}_i + (\hat{y}_i-\bar{y})^2\right]$

$=\sum \hat{u}_i^2 +2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) + \sum (\hat{y}_i -\bar{y})^2$

$= SSR + 2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) + SSE$

$= SSR + SSE \ \text{since} \ 2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) =0$

Also it may not be seemingly obvious that $2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) = 0$

Again I'll prove my claim:

Note that $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i$

So taking expectations of $y_i = \hat{\beta}_0 + \hat{\beta}_1x_i + \hat{u}_i \ \text{yields} \ \bar{y} = \hat{\beta}_0 + \hat{\beta}_1\bar{x} \ \text{since} \ E(\hat{u}) = 0$

Now $\hat{y}_1 - \bar{y} = \hat{\beta}_1 (x_i-\hat{u})$

So we have $2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) = 2\hat{\beta}_1 \sum \hat{u}_i(x_i - \bar{x})$

Now we know from the OLS assumptions that $\sum \hat{u}_i(x_i - \bar{x}) = 0$ . I will prove this:

Note that we assume $E(u|x) = E(u) = 0$ so we have $COV(u, x) = 0$ which implies $COV(\hat{u}, x) = 0$ but $COV(\hat{u}, x) = E\left[(x-\bar{x})(\hat{u}-\bar{\hat{u}})\right]$

Now $\bar{\hat{u}} = 0$ so we have $E\left[(x-\bar{x})(\hat{u}-\bar{\hat{u}})\right] = \sum \hat{u}_i(x_i - \bar{x}) = 0$ as required.

Thus SST = SSE + SSR

Now we define $R^2 = \frac{SSE}{SST} = \frac{\text{explained}}{\text{total}} = 1 - \frac{SSR}{SST}$

Thus the higher the R^2 the better the goodness of fit, however it can be shown that as we add more independent variables to the regression R^2 will increase even if the independent variables has nothing to contribute to the dependent variable, I won't show this as it is a bit complicated, thus it is sometimes better to use adjusted R^2, which can be calculated from excel and again I won't derive the adjusted R^2 as there are some more complicated assumptions we have to apply.

TrueTears · « **Reply #2 on:** July 03, 2012, 05:21:13 pm »

If you are interested however, wiki has a good article on R^2 and adjusted R^2

http://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2

jello7 · « **Reply #3 on:** July 03, 2012, 07:33:30 pm »

Quote from: TrueTears on July 03, 2012, 05:18:34 pm

Hi there, we normally measure goodness of fit using R^2 (or Adjusted R^2) and the concept of SST, SSR and SSE

There is also a wiki article you should have a read of however it is not very comprehensive. http://en.wikipedia.org/wiki/Total_sum_of_squares

I will try explain this in the following:

(Image removed from quote.)

Now we can think of each observation as being made up of an explained part and an unexplained part. Thus $y_i = \hat{y}_i+\hat{u}_i$ where $\hat{u}$ is the error term in the regression.

Thus we can define the explained part as $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i$ and the unexplained part as $\hat{u}_i$ where all the terms that contain a $\hat{}$ are the sample estimates of the population parameters.

Thus we have $y_i = \hat{\beta}_0 + \hat{\beta}_1x_i + \hat{u}_i$

Let us define $SST = \sum (y_i - \bar{y})^2$

$SSE = \sum (\hat{y}_i - \bar{y})^2$

$SSR = \sum \hat{u}_i^2$

It may not be immediately obvious but SST = SSE + SSR, the proof is given:

$\sum (y_i - \bar{y})^2 = \sum \left[(y_i - \hat{y}_i)(\hat{y}-\bar{y})^2\right]$

$= \sum \left[\hat{u}_i + (\hat{y}_i-\bar{y})^2\right]$

$=\sum \hat{u}_i^2 +2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) + \sum (\hat{y}_i -\bar{y})^2$

$= SSR + 2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) + SSE$

$= SSR + SSE \ \text{since} \ 2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) =0$

Also it may not be seemingly obvious that $2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) = 0$

Again I'll prove my claim:

Note that $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i$

So taking expectations of $y_i = \hat{\beta}_0 + \hat{\beta}_1x_i + \hat{u}_i \ \text{yields} \ \bar{y} = \hat{\beta}_0 + \hat{\beta}_1\bar{x} \ \text{since} \ E(\hat{u}) = 0$

Now $\hat{y}_1 - \bar{y} = \hat{\beta}_1 (x_i-\hat{u})$

So we have $2 \sum \hat{u}_i(\hat{y}_i -\bar{y}) = 2\hat{\beta}_1 \sum \hat{u}_i(x_i - \bar{x})$

Now we know from the OLS assumptions that $\sum \hat{u}_i(x_i - \bar{x}) = 0$ . I will prove this:

Note that we assume $E(u|x) = E(u) = 0$ so we have $COV(u, x) = 0$ which implies $COV(\hat{u}, x) = 0$ but $COV(\hat{u}, x) = E\left[(x-\bar{x})(\hat{u}-\bar{\hat{u}})\right]$

Now $\bar{\hat{u}} = 0$ so we have $E\left[(x-\bar{x})(\hat{u}-\bar{\hat{u}})\right] = \sum \hat{u}_i(x_i - \bar{x}) = 0$ as required.

Thus SST = SSE + SSR

Now we define $R^2 = \frac{SSE}{SST} = \frac{\text{explained}}{\text{total}} = 1 - \frac{SSR}{SST}$

Thus the higher the R^2 the better the goodness of fit, however it can be shown that as we add more independent variables to the regression R^2 will increase even if the independent variables has nothing to contribute to the dependent variable, I won't show this as it is a bit complicated, thus it is sometimes better to use adjusted R^2, which can be calculated from excel and again I won't derive the adjusted R^2 as there are some more complicated assumptions we have to apply.

Sorry, I'm still confused (still in year 12

) I'm not sure how to apply the above formulae to a data set, I'll explain my problem a little further. Say, I have the following data points:

(10, 3.00)
(20, 2.12)
(30, 1.67)
(40, 1.33)
(50, 1.15)
(60, 1.00)

For these points, I must create a function that most accurately models the data points, however the equation must be:

y=(b(c+a))/((x+a))-b (sorry dont know how to use Latex yet)

So, I have to find the parameters for a, b and c.

(^ this is probably irrelevant)

Through trial and error, I am able to find a couple of graphs that are fairly close, but I am unable to distinguish which models the data better merely from looking at it. Hence, a formula to do so is needed. I was just putting the values that the function was giving and minus-ing the actual values from it, ignoring the sign of the result and then finding the sum of all these 'difference values'. The smaller the difference value would mean the smaller the difference between the function and graph.

So how would I apply the formulas you have given to a problem like the one stated above?

TrueTears · « **Reply #4 on:** July 03, 2012, 07:50:02 pm »

oh opps thought you were in uni since this was posted in the general maths section

anyways, you can't apply what i said above to non linear regressions, what I stated was OLS (ordinary least squares), to apply it to the functional form you have given me takes much more work... well it's definitely simplified down in VCE but I'm not sure what you are expected to do.

jello7 · « **Reply #5 on:** July 03, 2012, 08:48:18 pm »

Quote from: TrueTears on July 03, 2012, 07:50:02 pm

oh opps thought you were in uni since this was posted in the general maths section

anyways, you can't apply what i said above to non linear regressions, what I stated was OLS (ordinary least squares), to apply it to the functional form you have given me takes much more work... well it's definitely simplified down in VCE but I'm not sure what you are expected to do.

Ohk sorry for the ambiguity within my post (I'm currently doing the International Baccalaureate, a similar program to VCE) , basically I have conducted an experiment within physics, the results show a hyperbolic relationship.

I have to find an equation for the relationship that best fits the data, the equation must be in the form in my previous post, that is:
y=(b(c+a))/((x+a))-b.

Now through trial and error on Graphmatica I have come up with several graphs to model the relationship, looking at a graph with both functions and the actual results however doesn't tell me which function is more accurate model. My teacher told me that to see which graph is a better fit, a goodness of fit formula can be used.

I believed he said to use something called 'residual mean square' (not quite sure if this is what it is called), and I think he said something along the lines of:

1: Write down the actual results, so the x and y values (the Independent Variable is x and Dependent Variable is y)
2: Write down the values of y given when a specific x value is put into the equation (so you have the x-value (the independent variable) and you want to see its result (the y value) so you put it into the equation).
3: Take away the predicted values of y (the values of y achieved from the equation), from the actual values, giving a difference between the model's prediction and the actual observation.
4: Square and then square root this 'difference value' to give an absolute value for the difference between the equations prediction for the value and the actual value
5: Sum up all of these absolute values for all the data points, the bigger the number the bigger the difference between the model and the actual data?

What I want to know is if the above method for finding 'goodness of fit' correct and that is how it should be done, does a smaller 'sum of absolute differences' mean it better models the data? I have two functions and I have used the method I stated above, ? a method which I'm not sure if I'm recalling correctly ?, and seeing which gives a smaller sum, I am thus using that function. Is this the correct approach to quantitatively distinguish which function better models the actual data?

Thanks a tonne TrueTears for your in-depth answer, sorry I wasn't clear enough.

Search the forums now!

We have moved!

Author Topic: Goodness of Fit (Read 3661 times) Tweet Share

jello7

Goodness of Fit

TrueTears

Re: Goodness of Fit

TrueTears

Re: Goodness of Fit

jello7

Re: Goodness of Fit

TrueTears

Re: Goodness of Fit

jello7

Re: Goodness of Fit

Recent Posts

User:
Password: