Author Topic: Monash Bus Stats question (Read 2075 times) Tweet Share

lynt.br · « **on:** November 06, 2010, 01:53:20 am »

I'm having trouble understanding question (1)(b) from the sample exam we were given in the last revision lecture.

The question asks: For a simple random sample, X-bar is an unbiased estimator of μ if E[X-bar] = μ. Prove that this is true when X ~ N(μ,σ²). What does this mean in practice?

I'm don't quite follow what the question is asking so I'm not sure what exactly it is I'm trying to prove, or how I would go about doing so.

Thanks!

TrueTears · « **Reply #1 on:** November 06, 2010, 01:58:59 am »

The proof is a fairly easy mathematical exercise:

$E(\bar{X}) = E\left[\frac{1}{n}\sum X_i\right]$ for some i.

$= \frac{1}{n}E\left[\sum X_i\right]$

$= \frac{1}{n}E[X_1+X_2+ \cdots + X_n]$

$= \frac{1}{n}\left(n \mu\right)$ since we are given $X$ ~ $N(\mu, \sigma^2)$ so $E(X_i) = \mu$

Thus $E(\bar{X}) = \mu$ as required.

The meaning of this is that whenever we take a random sample and calculate the sample mean we can be sure that the value we get is an unbiased estimate of the true population mean.

lynt.br · « **Reply #2 on:** November 06, 2010, 03:26:51 am »

Thanks for the reply but I still can't get my head around it -___-

Would you be able to explain those steps? That might help me a bit (I'm a bit of a lost cause when it comes to maths haha).

TrueTears · « **Reply #3 on:** November 06, 2010, 03:41:45 am »

haha alright, well the first line is just replacing $\bar{X}$ with it's summation notation.

Second line... you can take out a factor of $\frac{1}{n}$ since the expectation function (E( )) is independent of constants.

Then we expand the summation of $X_i$ on the third line, then on fourth line we replace each $X_1, X_2, X_3, ...$ with $\mu$ since it says in the question that X has an expected value, $E(X)$ , of $\mu$

But we have $n$ 'amount' of $X$ , so we have $n\mu$ in total.

Then the $\frac{1}{n}$ cancels out the $n$ which leaves us with $\mu$ !

lynt.br · « **Reply #4 on:** November 06, 2010, 03:51:23 am »

Quote from: TrueTears on November 06, 2010, 03:41:45 am

haha alright, well the first line is just replacing $\bar{X}$ with it's summation notation.

Second line... you can take out a factor of $\frac{1}{n}$ since the expectation function (E( )) is independent of constants.

Then we expand the summation of $X_i$ on the third line, then on fourth line we replace each $X_1, X_2, X_3, ...$ with $\mu$ since it says in the question that X has an expected value, $E(X)$ , of $\mu$

But we have $n$ 'amount' of $X$ , so we have $n\mu$ in total.

Then the $\frac{1}{n}$ cancels out the $n$ which leaves us with $\mu$ !

Ahhh it's like a light just flicked on in my head haha! Thanks for having patience with me, this made it a lot clearer!

TrueTears · « **Reply #5 on:** November 06, 2010, 02:34:34 pm »

No problem, glad to help!

lynt.br · « **Reply #6 on:** November 09, 2010, 06:29:26 pm »

Hey just a few questions from the exam

I didn't really get the last part (I think) on question 1. The question gave the summary statistics for incomes of white and black South Africans and said the summary statistics indicate the data is not very skewed, why is this not necessarily inconsistent with the previous question.

Previous question I said the data probably would be positively skewed because there are likely a few very high income earners in the sample. The summary statistics, however, appeared to just reinforce this conclusion, with a skewness of 4 and 6 indicating both samples were very positively skewed. I don't really follow what the question was asking because it appeared to me that both questions were consistent...

Also I'm assuming there was an error in the last question and they actually meant to introduce a dummy variable for year 2008 not 1998? I still answered the question as though it meant 1998 meaning I had to say the last part made no sense at all (although I did add a note saying it would make sense if there was a 2008 dummy variable).

TrueTears · « **Reply #7 on:** November 09, 2010, 06:36:29 pm »

Quote from: lynt.br on November 09, 2010, 06:29:26 pm

Hey just a few questions from the exam

I didn't really get the last part (I think) on question 1. The question gave the summary statistics for incomes of white and black South Africans and said the summary statistics indicate the data is not very skewed, why is this not necessarily inconsistent with the previous question.

Previous question I said the data probably would be positively skewed because there are likely a few very high income earners in the sample. The summary statistics, however, appeared to just reinforce this conclusion, with a skewness of 4 and 6 indicating both samples were very positively skewed. I don't really follow what the question was asking because it appeared to me that both questions were consistent...

Also I'm assuming there was an error in the last question and they actually meant to introduce a dummy variable for year 2008 not 1998? I still answered the question as though it meant 1998 meaning I had to say the last part made no sense at all (although I did add a note saying it would make sense if there was a 2008 dummy variable).

In short, for first question, the summary statistics separated White and Non-white, where as in the frequency distribution before we calculated the mean for ALL racial groups. As we can see from the descriptive statistics, white clearly earns more than non white and had more extreme values than non-white (check the maximum and range). Thus the skewness from earlier was probably due to extreme income earners from White (it was very positively skewed, which means most racial groups earn below the mean income but there were a few very high income earners, possibly due to extreme values from WHITE), where as if we separated these 2 racial groups, we can see that the skewness is less apparent cause we separated the effects of high earning white households with those nonwhite. So it doesn't contradict what we had before, just that it illustrates the income distribution differently.

For the last question, the dummy variable introduced for 1998 is correct. In fact they could have added it for 1999 if they wanted but 1998 makes more sense. The t is smaller because by increasing the values for each month in 1998 we smooth out the effect of the outlier in 2008. (Think about it this way, the outlier is "less of an" outlier since we increased the values of some of our data points). Thus this forces our linear model to have a smaller gradient by making it passes through "underneath" the outlier. On the exam I also drew a diagram to show the new linear model and the old linear model and compared them mathematically and proved t had to decrease. The values were added to 1998 because these values were much lower than the outlier in 2008, as the graph clearly had an upward trend, it was obvious that we want to increase smaller data points (which occurred in 1998) rather than other years since they were generally higher, so if we increase the values in other years, this would actually have worsened our model cause it would not have smoothed out the effects of the outlier effectively. You could also have argued that t decreased because the intercept clearly increased. Mathematically this makes sense, if a linear model has a bigger y axis intercept, it must have a smaller gradient.

Also I think I know why you thought adding a 2008 dummy variable would be appropriate. In fact that would make sense too, we would normally remove that outlier, so we could introduce a dummy variable to remove it, but another way of eliminating the effects of the outlier is to leave it but increase the values of other data points. Stupid, I agree, but a good question to test understanding on the exam

lynt.br · « **Reply #8 on:** November 09, 2010, 07:12:20 pm »

Quote from: TrueTears on November 09, 2010, 06:36:29 pm

Quote from: lynt.br on November 09, 2010, 06:29:26 pm
Hey just a few questions from the exam

I didn't really get the last part (I think) on question 1. The question gave the summary statistics for incomes of white and black South Africans and said the summary statistics indicate the data is not very skewed, why is this not necessarily inconsistent with the previous question.

Previous question I said the data probably would be positively skewed because there are likely a few very high income earners in the sample. The summary statistics, however, appeared to just reinforce this conclusion, with a skewness of 4 and 6 indicating both samples were very positively skewed. I don't really follow what the question was asking because it appeared to me that both questions were consistent...

Also I'm assuming there was an error in the last question and they actually meant to introduce a dummy variable for year 2008 not 1998? I still answered the question as though it meant 1998 meaning I had to say the last part made no sense at all (although I did add a note saying it would make sense if there was a 2008 dummy variable).
In short, for first question, the summary statistics separated White and Non-white, where as in the frequency distribution before we calculated the mean for ALL racial groups. As we can see from the descriptive statistics, white clearly earns more than non white and had more extreme values than non-white (check the maximum and range). Thus the skewness from earlier was probably due to extreme income earners from White (it was very positively skewed, which means most racial groups earn below the mean income but there were a few very high income earners, possibly due to extreme values from WHITE), where as if we separated these 2 racial groups, we can see that the skewness is less apparent cause we separated the effects of high earning white households with those nonwhite. So it doesn't contradict what we had before, just that it illustrates the income distribution differently.

Ah I see. I didn't really understand what the question was saying when it said the skewness is no longer apparent. I was under the impression both the descriptive stats for White and non-white were pretty positively skewed, just white was more skewed.

Quote

For the last question, the dummy variable introduced for 1998 is correct. In fact they could have added it for 1999 if they wanted but 1998 makes more sense. The t is smaller because by increasing the values for each month in 1998 we smooth out the effect of the outlier in 2008. (Think about it this way, the outlier is "less of an" outlier since we increased the values of some of our data points). Thus this forces our linear model to have a smaller gradient by making it passes through "underneath" the outlier. On the exam I also drew a diagram to show the new linear model and the old linear model and compared them mathematically and proved t had to decrease. The values were added to 1998 because these values were much lower than the outlier in 2008, as the graph clearly had an upward trend, it was obvious that we want to increase smaller data points (which occurred in 1998) rather than other years since they were generally higher, so if we increase the values in other years, this would actually have worsened our model cause it would not have smoothed out the effects of the outlier effectively. You could also have argued that t decreased because the intercept clearly increased. Mathematically this makes sense, if a linear model has a bigger y axis intercept, it must have a smaller gradient.

Also I think I know why you thought adding a 2008 dummy variable would be appropriate. In fact that would make sense too, we would normally remove that outlier, so we could introduce a dummy variable to remove it, but another way of eliminating the effects of the outlier is to leave it but increase the values of other data points. Stupid, I agree, but a good question to test understanding on the exam

mm yeah.. because If we increase just 1998... aren't we just distorting the data even more? I thought the idea of introducing a new dummy variable was to rule out the unusual or uncommon data (i.e 2008) so we have a model that better represents normal conditions? I guess this is why I said I thought it should have been 2008 dummy...

TrueTears · « **Reply #9 on:** November 10, 2010, 01:25:30 am »

The reason why we increase 1998 values is to decrease the effects of the outlier. Yeah I know, it's a pretty shit way of doing it, might as well just exclude the outlier, but I think they were trying to test our understanding...

Search the forums now!

We have moved!

Author Topic: Monash Bus Stats question (Read 2075 times) Tweet Share

lynt.br

Monash Bus Stats question

TrueTears

Re: Monash Bus Stats question

lynt.br

Re: Monash Bus Stats question

TrueTears

Re: Monash Bus Stats question

lynt.br

Re: Monash Bus Stats question

TrueTears

Re: Monash Bus Stats question

lynt.br

Re: Monash Bus Stats question

TrueTears

Re: Monash Bus Stats question

lynt.br

Re: Monash Bus Stats question

TrueTears

Re: Monash Bus Stats question

Recent Posts

User:
Password: