Monday, July 18, 2011

Tests and your projects

For your projects, an important requirement is that you write the hypotheses appropriate for your project and then carry out a statistical test. These tests are:

1. One population, probably testing the mean against some hypothesised mean. This is Chapter 9. For example, the Skytrain management claims that their trains are never more than 3 minutes behind schedule.

2. Two populations, testing whether there is any difference in the means of two populations. This is Chapter 10. For example, do cosmetics cost the same at London Drugs and Pharmasave?

3. Test of independence, or 'chi-squared'. This is Chapter 12. For example, do men or women drink a certain type of drink in Starbucks. You would want to get your data into a structure looking like a contingency table.

4. Regression, testing for the relationship between a dependent variable and one or more independent variables. This is Chapter 13. For example, does the unemployment rate have any relationship to the sale of used cars?

Note that we have to do a statistical test of your hypotheses using one or more of the methods above. To do this we need data. So if you can't find any data, then you'll need to think again.

You can collect data yourself, or from a reliable source. If you need any help, you have only to ask me. If you collect data yourself, be aware of privacy issues. No names or personal details, please.

Saturday, July 16, 2011

1 - norm.dist

Link to Youtube

When to use just norm.dist and when to subtract from one ? The key is remembering that Excel adds up from the LEFT of the distribution. So =norm.dist() will give you the probability smaller than whatever value for X you have put into the data. If you wanted 'greater' than some X value, then you must subtract from 1.

Example: the mean is 350, the standard deviation is 10. Find the probability of getting a value smaller than 335.

Solution: the question is asking us to find the probability of 335 or less. So = norm.dist(335, 350, 10, true) = 0.067 (rounded).

Example: what is the probability of finding a probability of 370 or more..... now we want the area to the RIGHT of 370 in the distribution. So first find =norm.dist(370, 350, 10, true) = 0.98. And now subtract from 1 to give 1 - 0.98 = 0.02.


Note that with norm.dist we always use 'true'. The functions I've been using are for Excel 2010. For older versions of Excel, you miss out the dot.

Friday, July 15, 2011

Minimum sample size

Here's a question: the boss specifies that the margin of error should be 5 at 95%. The standard deviation is known at 10. What is the minimum sample size required to achieve that margin of error?

This question is important because sampling is expensive and we don't want to take more samples than we need to achieve some specific level of accuracy. There is no point. You are throwing away money.

Recall how we find the margin of error at 95% by hand:

M of E = 1.96 * SE. And the standard error is the standard deviation / square root of sample size.

Here we are given the M of Error (the boss says make it 5) and we know the standard deviation is 10. Put in what you know:

5 = 1.96 * 10/square root of n

Do the algebra and switch around the 5 and the square root of n

so square root of n = 1.96 * 10/5

square root of n = 3.92

We want n not the square root of n. So SQUARE both sides to get

n = 3.92 * 3.92 = 15.37 BUT now you MUST round UP to 16. Not possible to have an incomplete sample.

How many samples?

Here's a question: how many samples of size 3 can you draw from a population of 5? Here n = 3 and N = 5.
The formula is N!/n! * (N-n)! The ! means 'factorial'. You multiply out the number followed by all the integers smaller than the number. So 3! = 3 * 2 * 1 = 6. It's a shorthand.

To answer this question: the number of different samples that can be drawn is
5!/3!(5-3)! = 120/6* 2 = 10

Note that there is a ! key on most calculators. You might want to practice a bit.

Friday, July 8, 2011

Skew

The 'skew' of a dataset indicates how much the distribution of the values in the variable differs from a symmetric distribution. A symmetric distribution (such as the normal distribution) can be divided into two parts by a vertical line drawn through the mean...which is also the median and the mode. By contrast, skewed data has different values for the mean and the median. The histogram here is of gasoline prices and shows some skew to the right. Try this yourself with the "Gasoline" datatset.
We can see that there are some high gas prices which 'drag' the data to the right. So we would expect the mean to be larger than the median. Which it is: mean is 234.96, median is 234.25. The difference is small and also by eye we can see that the skew isn't much. As a result the figure for 'Skewness' given in the Descriptive Statistics is only 0.07993. Note that this is a positive number, indicating a skew to the right (where the tail is). A negative skew is the same thing, but in reverse.

Tuesday, July 5, 2011

When to use the t distribution?

When to use the t distribution?

On pg 333, chapter 8, question 22, the sample size is less than 30, the SD is unknown, but in part (B) of the question it says "the population has a normal distribution".

So should you use the T distribution or normal distribution? Go for the T dist because the standard deviation is unknown. So, even if the population is normally distributed, if you don’t know its standard deviation, use the T. In fact I always use the T distribution in applied (real) work. It is safer, giving you a wider margin or error and therefore confidence interval. Therefore the chances of rejecting a true null (and therefore making a Type 1 error are smaller. Nice question, thanks RB.

Note that you should be able to find confidence intervals using the normal distribution and for the t-distribution  using the Data Analysis> Descriptive Stats. There is a Youtube here

Point to note: don't mix up the Margin of Error and the Confidence Interval. We need to find the Margin of Error to create the Confidence Interval. The Confidence Interval is the sample mean (xbar) plus/minus the Margin of Error.

Hypothesis Test Worked Example

You find that you get to love stats so much you want to take more courses! To pay the tuition fees, you get a job in a restaurant as a chef. The boss says that he always budgets for 120gm of smoked salmon in the smoked salmon omelette dish, but you suspect him of lying. You secretly measure the amounts over 42 samples. You find that the mean amount is 117gm with a sample standard deviation of 15gm. This is important: if the boss is lying you could blackmail him and get rich! But the point is you don't know if he is putting in too much or too little!
a. Write the hypotheses for a two-tailed test
b. Test at 95%
c. Report your results
d. What are your conclusions?
Solutions....
a.
b. This is a two-tailed test. n > 30 so we could make use of the central limit theorem and use the normal distribution. But for myself I prefer to use the t distribution. So in Excel that’s =t.dist.2t(1.296,41,true)
giving you a p value of 0.2
c. We were asked to test at 95%, giving an alpha of 0.05. The p value that we calculated is larger than alpha (0.2 > 0.05) so following the rejection rule we fail to reject . Shame about that!