Tuesday, September 20, 2011

Cross-tabulation and scatter plots

Cross-tabulations and scatter plots can be used on the same data to give different insights. Using the Fortune dataset (available in the Datasets folder) I have made two Youtubes:

Click on the links to launch:

Scatter plots

Cross-tabulation

Monday, September 19, 2011

Data Sources for your Project

You can either collect your own data or use data from the Internet. There is plenty of it! Just remember to include a citation. Below are some links to data sources to get you going. If you find more, please share with me and I will add to the list. The library is an excellent place to start. And ask the helpful librarians if you can't find what you are looking for.


Vancouver
World Bank
 BC Data
Kwantlen Library…business stats

Monday, July 18, 2011

Tests and your projects

For your projects, an important requirement is that you write the hypotheses appropriate for your project and then carry out a statistical test. These tests are:

1. One population, probably testing the mean against some hypothesised mean. This is Chapter 9. For example, the Skytrain management claims that their trains are never more than 3 minutes behind schedule.

2. Two populations, testing whether there is any difference in the means of two populations. This is Chapter 10. For example, do cosmetics cost the same at London Drugs and Pharmasave?

3. Test of independence, or 'chi-squared'. This is Chapter 12. For example, do men or women drink a certain type of drink in Starbucks. You would want to get your data into a structure looking like a contingency table.

4. Regression, testing for the relationship between a dependent variable and one or more independent variables. This is Chapter 13. For example, does the unemployment rate have any relationship to the sale of used cars?

Note that we have to do a statistical test of your hypotheses using one or more of the methods above. To do this we need data. So if you can't find any data, then you'll need to think again.

You can collect data yourself, or from a reliable source. If you need any help, you have only to ask me. If you collect data yourself, be aware of privacy issues. No names or personal details, please.

Saturday, July 16, 2011

1 - norm.dist

Link to Youtube

When to use just norm.dist and when to subtract from one ? The key is remembering that Excel adds up from the LEFT of the distribution. So =norm.dist() will give you the probability smaller than whatever value for X you have put into the data. If you wanted 'greater' than some X value, then you must subtract from 1.

Example: the mean is 350, the standard deviation is 10. Find the probability of getting a value smaller than 335.

Solution: the question is asking us to find the probability of 335 or less. So = norm.dist(335, 350, 10, true) = 0.067 (rounded).

Example: what is the probability of finding a probability of 370 or more..... now we want the area to the RIGHT of 370 in the distribution. So first find =norm.dist(370, 350, 10, true) = 0.98. And now subtract from 1 to give 1 - 0.98 = 0.02.


Note that with norm.dist we always use 'true'. The functions I've been using are for Excel 2010. For older versions of Excel, you miss out the dot.

Friday, July 15, 2011

Minimum sample size

Here's a question: the boss specifies that the margin of error should be 5 at 95%. The standard deviation is known at 10. What is the minimum sample size required to achieve that margin of error?

This question is important because sampling is expensive and we don't want to take more samples than we need to achieve some specific level of accuracy. There is no point. You are throwing away money.

Recall how we find the margin of error at 95% by hand:

M of E = 1.96 * SE. And the standard error is the standard deviation / square root of sample size.

Here we are given the M of Error (the boss says make it 5) and we know the standard deviation is 10. Put in what you know:

5 = 1.96 * 10/square root of n

Do the algebra and switch around the 5 and the square root of n

so square root of n = 1.96 * 10/5

square root of n = 3.92

We want n not the square root of n. So SQUARE both sides to get

n = 3.92 * 3.92 = 15.37 BUT now you MUST round UP to 16. Not possible to have an incomplete sample.

How many samples?

Here's a question: how many samples of size 3 can you draw from a population of 5? Here n = 3 and N = 5.
The formula is N!/n! * (N-n)! The ! means 'factorial'. You multiply out the number followed by all the integers smaller than the number. So 3! = 3 * 2 * 1 = 6. It's a shorthand.

To answer this question: the number of different samples that can be drawn is
5!/3!(5-3)! = 120/6* 2 = 10

Note that there is a ! key on most calculators. You might want to practice a bit.

Friday, July 8, 2011

Skew

The 'skew' of a dataset indicates how much the distribution of the values in the variable differs from a symmetric distribution. A symmetric distribution (such as the normal distribution) can be divided into two parts by a vertical line drawn through the mean...which is also the median and the mode. By contrast, skewed data has different values for the mean and the median. The histogram here is of gasoline prices and shows some skew to the right. Try this yourself with the "Gasoline" datatset.
We can see that there are some high gas prices which 'drag' the data to the right. So we would expect the mean to be larger than the median. Which it is: mean is 234.96, median is 234.25. The difference is small and also by eye we can see that the skew isn't much. As a result the figure for 'Skewness' given in the Descriptive Statistics is only 0.07993. Note that this is a positive number, indicating a skew to the right (where the tail is). A negative skew is the same thing, but in reverse.