Hey just a few questions from the exam
I didn't really get the last part (I think) on question 1. The question gave the summary statistics for incomes of white and black South Africans and said the summary statistics indicate the data is not very skewed, why is this not necessarily inconsistent with the previous question.
Previous question I said the data probably would be positively skewed because there are likely a few very high income earners in the sample. The summary statistics, however, appeared to just reinforce this conclusion, with a skewness of 4 and 6 indicating both samples were very positively skewed. I don't really follow what the question was asking because it appeared to me that both questions were consistent...
Also I'm assuming there was an error in the last question and they actually meant to introduce a dummy variable for year 2008 not 1998? I still answered the question as though it meant 1998 meaning I had to say the last part made no sense at all (although I did add a note saying it would make sense if there was a 2008 dummy variable).
In short, for first question, the summary statistics separated White and Non-white, where as in the frequency distribution before we calculated the mean for ALL racial groups. As we can see from the descriptive statistics, white clearly earns more than non white and had more extreme values than non-white (check the maximum and range). Thus the skewness from earlier was probably due to extreme income earners from White (it was very positively skewed, which means most racial groups earn below the mean income but there were a few very high income earners, possibly due to extreme values from WHITE), where as if we separated these 2 racial groups, we can see that the skewness is less apparent cause we separated the effects of high earning white households with those nonwhite. So it doesn't contradict what we had before, just that it illustrates the income distribution differently.
For the last question, the dummy variable introduced for 1998 is correct. In fact they could have added it for 1999 if they wanted but 1998 makes more sense. The t is smaller because by increasing the values for each month in 1998 we smooth out the effect of the outlier in 2008. (Think about it this way, the outlier is "less of an" outlier since we increased the values of some of our data points). Thus this forces our linear model to have a smaller gradient by making it passes through "underneath" the outlier. On the exam I also drew a diagram to show the new linear model and the old linear model and compared them mathematically and proved t had to decrease. The values were added to 1998 because these values were much lower than the outlier in 2008, as the graph clearly had an upward trend, it was obvious that we want to increase smaller data points (which occurred in 1998) rather than other years since they were generally higher, so if we increase the values in other years, this would actually have worsened our model cause it would not have smoothed out the effects of the outlier effectively. You could also have argued that t decreased because the intercept clearly increased. Mathematically this makes sense, if a linear model has a bigger y axis intercept, it must have a smaller gradient.
Also I think I know why you thought adding a 2008 dummy variable would be appropriate. In fact that would make sense too, we would normally remove that outlier, so we could introduce a dummy variable to remove it, but another way of eliminating the effects of the outlier is to leave it but increase the values of other data points. Stupid, I agree, but a good question to test understanding on the exam