Friday, November 18, 2011

Chapter 7 Q22

Q. The annual cost of automobile insurance is $939. Assume that the population standard deviation is $245. Find the probability that a SRS of insurance policies will have a sample mean within $25 of the population mean for sample sizes 30, 50, 100, and 400.

A. The interval is between 939 – 25, and 939 + 25 or 914 to 964. We can do this in Excel. For the first sample, where n =30, find the standard error. This is the population standard deviation divided by the square root of the sample size. Where n =30, the standard error is 44.73.

First find the probability from the extreme left of the distribution to 964. In Excel that is =norm.dist(964,939,44.73,true) = 0.71. 

Now find the probability from the extreme left of the distribution to 914. In Excel this is =norm.dist(914,939,44.73,true) = 0.29. 

The final step is to subtract the smaller probability from the larger one. This gives the probability of the area between 914 and 964. This is 0.71 – 0.29 = 0.42.

Follow the same steps with the larger samples. You will see that the probability increases with the sample size. As the sample size increases, the standard error decreases and we are more confident of the location of the unknown population parameter µ. For example, with a sample size of n=400, the probability increases to 0.96.

In this question we know µ. But make the intellectual leap to see that we can use the same method to estimate the location of µ if we did not know it. 

Wednesday, November 16, 2011

Chapter 6 Q20

An alert student has noticed a mistake in the textbook answers. Here I've worked through Q20 from Chapter 6.


a.       a. We want the area to the left of 50, because the question asks for fewer than 50 hours. Use =norm.dist(50,77,20,true) to get 0.089

b.      b. We want more than 100, the area to the right of 100. Excel adds up from the left, so we need to subtract from 1. Use =1-norm.dist(100,77,20,true) to get 0.125

c.       c. The question asks for the “upper 20%”. We are asked to find a random variable which corresponds to this area. In other words, what is the value of the random variable X which separates the top 20% from the bottom 80%. We can use =norm.inv for this, but we will need to write =norm.inv(0.8,77,20) to get 93.83. If we had written =norm.inv(0.2,77,20) then the result would the value of the random variable X that separates the bottom 20% from the top 80%. Draw a sketch and make sure you get this. This posting on our blog might help make this clearer.

Tuesday, November 15, 2011

Q4 p333

Q4 p333

Q: A 95% confidence interval for a population mean was reported to be 152 to 160. If σ = 15, what sample size was used in this study?

A: The mean must be right in the middle of the CI, so here the mean, xbar, is 156. Therefore the Margin of Error was 4. Recall how we find the MofE in the first place? It is MofE =1.96*sigma/root n.

So in this case 4=(1.96 *15)/root n.

Switch the terms around so that root n = (1.96*15)/4 = 7.35 That’s the square root of the sample size, so square it to get n =54.02. ROUND UP to get n = 55

Sunday, November 13, 2011

Chapter 7 Good Questions Answered

Chapter 7 Q20b

Q. Mean length of employment is 17.5 weeks and population standard deviation is 4 weeks. You take a sample of 50 unemployed individuals for a follow-up study. What is the probability that the sample will provide a sample mean within one week of the population mean?

A. This question is asking you to find the area (which represents a probability) one week either side of the population mean. In Excel, go:

=norm.dist(18.5,17.5,4,true) – normdist(16.5,17.5,4,true)

Chapter 7 Q44

Survey results give the standard error of the mean as 20. The population standard deviation is 500.

a.       How large was the sample used? Answer: the SE = standard deviation/root n. So root n = standard deviation/SE. Here root n = 500/20 = 25. So n = 25 squared.

b.      What is the probability that the point estimate was within plus/minus 25 of the population mean?

 Answer: we are looking for the area between the population mean plus 25 and the population mean minus 25. The population mean isn`t given, but this doesn`t matter. Use any number you like. Here I`ve used 100.  In Excel: =norm.dist(125,100,500,true)-norm.dist(75,100,500,true)

Friday, November 11, 2011

Difference between binomdist and norm.dist

We use binomdist when:

1. The random variables are "discrete". For example number of cockroaches found in a jar would be discrete, because it pretty much has to be an integer

2. There is the idea of "trials"

binomdist gives a probability. Now, watch for "false" and "true". Here is a Youtube on this.

We use norm.dist when the random variable is continuous. This posting (in my blog here) might help.

Tuesday, September 20, 2011

Cross-tabulation and scatter plots

Cross-tabulations and scatter plots can be used on the same data to give different insights. Using the Fortune dataset (available in the Datasets folder) I have made two Youtubes:

Click on the links to launch:

Scatter plots

Cross-tabulation

Monday, September 19, 2011

Data Sources for your Project

You can either collect your own data or use data from the Internet. There is plenty of it! Just remember to include a citation. Below are some links to data sources to get you going. If you find more, please share with me and I will add to the list. The library is an excellent place to start. And ask the helpful librarians if you can't find what you are looking for.


Vancouver
World Bank
 BC Data
Kwantlen Library…business stats

Monday, July 18, 2011

Tests and your projects

For your projects, an important requirement is that you write the hypotheses appropriate for your project and then carry out a statistical test. These tests are:

1. One population, probably testing the mean against some hypothesised mean. This is Chapter 9. For example, the Skytrain management claims that their trains are never more than 3 minutes behind schedule.

2. Two populations, testing whether there is any difference in the means of two populations. This is Chapter 10. For example, do cosmetics cost the same at London Drugs and Pharmasave?

3. Test of independence, or 'chi-squared'. This is Chapter 12. For example, do men or women drink a certain type of drink in Starbucks. You would want to get your data into a structure looking like a contingency table.

4. Regression, testing for the relationship between a dependent variable and one or more independent variables. This is Chapter 13. For example, does the unemployment rate have any relationship to the sale of used cars?

Note that we have to do a statistical test of your hypotheses using one or more of the methods above. To do this we need data. So if you can't find any data, then you'll need to think again.

You can collect data yourself, or from a reliable source. If you need any help, you have only to ask me. If you collect data yourself, be aware of privacy issues. No names or personal details, please.

Saturday, July 16, 2011

1 - norm.dist

Link to Youtube

When to use just norm.dist and when to subtract from one ? The key is remembering that Excel adds up from the LEFT of the distribution. So =norm.dist() will give you the probability smaller than whatever value for X you have put into the data. If you wanted 'greater' than some X value, then you must subtract from 1.

Example: the mean is 350, the standard deviation is 10. Find the probability of getting a value smaller than 335.

Solution: the question is asking us to find the probability of 335 or less. So = norm.dist(335, 350, 10, true) = 0.067 (rounded).

Example: what is the probability of finding a probability of 370 or more..... now we want the area to the RIGHT of 370 in the distribution. So first find =norm.dist(370, 350, 10, true) = 0.98. And now subtract from 1 to give 1 - 0.98 = 0.02.


Note that with norm.dist we always use 'true'. The functions I've been using are for Excel 2010. For older versions of Excel, you miss out the dot.

Friday, July 15, 2011

Minimum sample size

Here's a question: the boss specifies that the margin of error should be 5 at 95%. The standard deviation is known at 10. What is the minimum sample size required to achieve that margin of error?

This question is important because sampling is expensive and we don't want to take more samples than we need to achieve some specific level of accuracy. There is no point. You are throwing away money.

Recall how we find the margin of error at 95% by hand:

M of E = 1.96 * SE. And the standard error is the standard deviation / square root of sample size.

Here we are given the M of Error (the boss says make it 5) and we know the standard deviation is 10. Put in what you know:

5 = 1.96 * 10/square root of n

Do the algebra and switch around the 5 and the square root of n

so square root of n = 1.96 * 10/5

square root of n = 3.92

We want n not the square root of n. So SQUARE both sides to get

n = 3.92 * 3.92 = 15.37 BUT now you MUST round UP to 16. Not possible to have an incomplete sample.

How many samples?

Here's a question: how many samples of size 3 can you draw from a population of 5? Here n = 3 and N = 5.
The formula is N!/n! * (N-n)! The ! means 'factorial'. You multiply out the number followed by all the integers smaller than the number. So 3! = 3 * 2 * 1 = 6. It's a shorthand.

To answer this question: the number of different samples that can be drawn is
5!/3!(5-3)! = 120/6* 2 = 10

Note that there is a ! key on most calculators. You might want to practice a bit.

Friday, July 8, 2011

Skew

The 'skew' of a dataset indicates how much the distribution of the values in the variable differs from a symmetric distribution. A symmetric distribution (such as the normal distribution) can be divided into two parts by a vertical line drawn through the mean...which is also the median and the mode. By contrast, skewed data has different values for the mean and the median. The histogram here is of gasoline prices and shows some skew to the right. Try this yourself with the "Gasoline" datatset.
We can see that there are some high gas prices which 'drag' the data to the right. So we would expect the mean to be larger than the median. Which it is: mean is 234.96, median is 234.25. The difference is small and also by eye we can see that the skew isn't much. As a result the figure for 'Skewness' given in the Descriptive Statistics is only 0.07993. Note that this is a positive number, indicating a skew to the right (where the tail is). A negative skew is the same thing, but in reverse.

Tuesday, July 5, 2011

When to use the t distribution?

When to use the t distribution?

On pg 333, chapter 8, question 22, the sample size is less than 30, the SD is unknown, but in part (B) of the question it says "the population has a normal distribution".

So should you use the T distribution or normal distribution? Go for the T dist because the standard deviation is unknown. So, even if the population is normally distributed, if you don’t know its standard deviation, use the T. In fact I always use the T distribution in applied (real) work. It is safer, giving you a wider margin or error and therefore confidence interval. Therefore the chances of rejecting a true null (and therefore making a Type 1 error are smaller. Nice question, thanks RB.

Note that you should be able to find confidence intervals using the normal distribution and for the t-distribution  using the Data Analysis> Descriptive Stats. There is a Youtube here

Point to note: don't mix up the Margin of Error and the Confidence Interval. We need to find the Margin of Error to create the Confidence Interval. The Confidence Interval is the sample mean (xbar) plus/minus the Margin of Error.

Hypothesis Test Worked Example

You find that you get to love stats so much you want to take more courses! To pay the tuition fees, you get a job in a restaurant as a chef. The boss says that he always budgets for 120gm of smoked salmon in the smoked salmon omelette dish, but you suspect him of lying. You secretly measure the amounts over 42 samples. You find that the mean amount is 117gm with a sample standard deviation of 15gm. This is important: if the boss is lying you could blackmail him and get rich! But the point is you don't know if he is putting in too much or too little!
a. Write the hypotheses for a two-tailed test
b. Test at 95%
c. Report your results
d. What are your conclusions?
Solutions....
a.
b. This is a two-tailed test. n > 30 so we could make use of the central limit theorem and use the normal distribution. But for myself I prefer to use the t distribution. So in Excel that’s =t.dist.2t(1.296,41,true)
giving you a p value of 0.2
c. We were asked to test at 95%, giving an alpha of 0.05. The p value that we calculated is larger than alpha (0.2 > 0.05) so following the rejection rule we fail to reject . Shame about that!

Tuesday, June 21, 2011

zzz....z-scores!

Link to Youtube: z score calculations in Excel


Introduction
The z score of any individual observation tells us how far that individual observation is away from the mean of all the observations. The larger the z score, the further away. If the z score was zero, then the observation would be the same as the mean. It is òn`the mean, so no distance away. We use the z scores to detect outlying observations, and it is fundamentally important in inferential statistics. Before we start, make sure you're familiar with the concept of the standard deviation.

How we calculate  z scores
The unit of measurement is the standard deviation....think about this. A metre or an inch is a unit of measurement. Why not use a standard deviation? The z score is measured in numbers of standard deviations. So the z score for any particular observation is the number of standard deviations that observation is away from the mean. Take a look at the sketch below.


We find the a z score by using this equation:


So the z score for any individual observation (that is the meaning of the little i) is the difference between the observation and the mean, divided by the standard deviation. So for example, if the mean was 10, the observation was 15, and one standard deviation was 5, then the z score would be 1. 15 is one standard deviation above the mean.

A z score can be negative as well as positive. A negative z score tells us that the observation is smaller than the mean. It is to the left of the mean.

z scores and outliers
We define an outlier as being an observation that is more than three standard deviations away from the mean. Because a z score is in units of standard deviations that means we can use the z score directly. This is what you do:

1. If you want to find the z score for any particular observation you can do the calculations by hand. Example: the observation is 600. The mean is 500. The standard deviation is 50. So the z score is:

(600-500)/50 = 2. So 600 has a z score of 2. 600 is two standard deviations away from the mean. It is not an outlier.

2. If you have a number of observations, you can use Excel. Please see the Youtube which uses the Earnings per share dataset, which is available to you in the Datasets folder. Go on, practise!

Saturday, June 18, 2011

Standard errors and why we are here!

Link to Youtube

The whole point of statistics is to use information from a sample to estimate a parameter in a population. Intuitively, wouldn't you agree that the larger the sample, the more accurate the estimate of the parameter? If we have a sample where n>=30, we can make use of the Central Limit Theorem to state that the sampling distribution of our favourite statistic, xbar, follows an approximately normal distribution (the website I just linked to has a very nice simulation).

The sampling distribution of xbar has its own mean and standard deviation. To avoid mixing up the standard deviation of the population and the standard deviation of the sampling distribution of xbar, we call the latter one the standard error. We find it by dividing the population standard deviation by the square root of the sample size, like this:


This is an incredibly important little equation which we'll see lots of times. Can you see that as the sample size increases, the SE (standard error) decreases? 

All this has a highly practical use for estimating a population mean. Which is what we are trying to do in the first place. To start off, let's just imagine that we happen to know the population mean, and it is $51800 (this is from the EAI dataset). The population standard deviation is $4000. We draw a sample of size n = 30. The standard error is


xbar, or the sample mean from our sample of 30, is let's say $52300. What is the probability that the sample mean is within $500 of the population mean? Now we can use Excel for this, recall the function =normdist? 
We want within $500 of the population mean, so that is from 51800 - 500 = 51300 to 51800 + 500 = 52300. Draw a little sketch....showing that we are looking for the area of the bit shaded in red. This will be the probability that the sample mean is within 500 plus or minus of the population mean.


We need to do two calculations in Excel, and then subtract one from the other to find the area in between, which is also the probability (of course!). Go

=normdist(52300,51800,730.3,true) and =normdist(51300,51800,730.3,true). You should end up with 0.7532 - 0.2468 = 0.5064. The meaning of this result: there is a 50/50 chance that the sample mean is within $500 of the population mean. Not too hopeful is it?

Now try the same thing all over again, BUT increase the sample size to n = 100. What happens? Think through what exactly is going on here. Here's my output...but please do it yourself! And take a look at the Youtube on this:


                 

                 

Thursday, June 16, 2011

Empirical Rule Worked Example

Empirical Rule Worked Example


Q6. The vodka dispensing machine keeps malfunctioning. However----it can be regulated to give µ ounces (I don`t know: what measures did they use?). If the ounces of fill are normally distributed with a standard deviation of 0.4 ounces, at what value should we set µ so that a 6 ounce vodka mug will overflow only 2.5% of the time?

Answer: We know from the Empirical Rule that 95% of the observations are within 2 standard deviations of the mean. So that means that 5% of the observations are in the tails to the left and right of 95%. The normal distribution is symmetric, so each of the two tails contains 2.5% of the observations. We want to set the machine so that 6 ounces marks the point where 2.5% of the drinks overflow (dreadful waste of vodka if you ask me!). Look at my beautiful sketch. Make sure you can understand why having 2.5% in one tail means that the random variable that corresponds to that area must be 2 standard deviations from the mean.

We’re given that the standard deviation is 0.4 ounces. Two standard deviations is 0.4 * 2 = 0.8. So µ must be 0.8 ounces from 6 ounces. So we should set µ = 6 – 0.8 = 5.2 ounces.

Norminv and Normdist

Norminv and Normdist

Your boss says that 27% of customers order wine that costs above a certain amount. What is that amount. Mean 24, SD4.


Notice the word ‘above’. So we want to find the value of the random variable ‘x’....here the price of the wine....that separates the top 27% of the customers from the bottom---and here’s the key---73%. Recall that Excel goes from left to right. We know the percentage, want to find the value that gives that percentage. In this case, norminv is the boy! So go
=norminv(0.73,24,4)=26.45.

Now, a contrasting problem and more complex. In a restaurant, the bills are normally distributed. The mean bill is 28, and the standard deviation is 6. If 12 of the day’s bills are over 43.06 how many customers did the restaurant have that day?
First, find the percentage of the total bills occupied by the area to the right of 43.06. Here we have the value of the random variable and we want to get the area (or probability, whichever way you want to look at it). Recall that Excel adds from left to right....and our area is clearly to the right. It says ‘above’. So go
=1-normdist(43.06,28,6,true) and get 0.006.


This means that 0.6% of the restaurant’s bills were over 43.06.
Now, we know that 0.6% of the total number of bills is 12 in number. Call the total number of bills ‘L’. We are trying to find L.
0.006*L = 12.
Divide both sides by 0.006 to find L
L = 2000.

Correlation example

Here's an interesting (and beautifully written) example of correlation

http://www.economist.com/blogs/dailychart/2011/06/obesity-and-driving&fsrc=nwl

This would have made a neat project for someone!

Sunday, June 12, 2011

Independence explained

Here is a simple way to 'get' the meaning of independence in probability. First, recall the test for independence:

if P(A|B)=P(A) then A and B are independent. The probability of B doesn't affect the probability of A. Think of examples when this does and doesn't happen. Using examples can make these concepts easier to grasp.

Now, how do we get P(A|B) ? It is P(A|B) = P(AnB)/P(B) ...the little 'n' means the intersection. In words, p of a given b equals the intersection of a and b divided by the probability of b.

How do we get P(AnB)? By multiplying together P(A) * P(B). Think of a probability tree. So we can rewrite

P(A|B) as P(A) * P(B)/P(B). Now, probabilities are just numbers, so we can cancel out the P(B) on the right hand side, leaving just P(A). Do you get it now?

Saturday, June 11, 2011

Probability tree


This is question 11 from the old midterm in the Documentation folder

11. The ship has two entertainers, who aren't very good. One is a lady opera singer, and the other a dance instructor. They hate each other and act quite independently. The opera singer has a bad temper and the probability of her being in a good enough temper to sing is 0.4. The dance instructor is frequently drunk and unable to stand, let alone teach dancing. The probability of him being drunk is 0.6. Draw a probability tree and find:]



a. (6) That both of them are able to perform on any one night?

b. (6) The probability that one and only one of them can perform on any one night? 

Solution: notice the word 'independently'. This means that you can multiply the probabilities. They are independent, so P(A|B) = P(A).

Draw a probability tree (see the Youtube, link at top of page). 

Answer to 'a' is Probability singer sings * Probability dance instructor not drunk = 0.4 * 0.4 = 0.16

For 'b' the event contains two sample points---singer sings * dancer drunk + singer doesn't sign * dancer not drunk



Keenies 4 Q1


Company Size


Large
Small-to-midsized

Stock
Yes
  40
  43
83
Options
No
149
137
286

189
180
369




         a.Offered stock options)  83/369  0.2249
  1. P (small-to-midsized and did not offer stock options)  137/369  0.3713
  2. P (small-to-midsized or offered stock options)  (180  83  43)/369  0.5962
  3. The probability of “small-to-midsized or offered stock options” includes the probability of “small-to-midsized and offered stock options”, the probability of “small-to-midsized but did not offer stock options” and the probability of “large and offered stock options”.

Friday, June 10, 2011

Probability and Addition Law Chapter 4 Q30 (adapted)

Link to Youtube

Suppose we have two events, A and B with P(A)=0.50 and P(B)=0.60 and P(AnB)=0.40. (Sorry...the n means intersection)

a. Find P(AUB). This is the union of A and B, or the probability that a sample point is in A or B or both. Use the Addition Law, to give P(AUB) = P(A)+P(B)-P(AnB = 0.50+0.60-0.40=0.70.

b. Find P(A|B) this is P(A) given P(B). The sample space is P(B). Use the formula so P(A|B) = P(AnB)/P(B)
= 0.40/0.60=0.67.

c. Are A and B independent (this is a good question!). If they were independent, then it would be true that
P(A|B)=P(A). In other words, whatever happens to B doesn't affect A. Let's check: is this true?

We know that P(A|B)=0.67. We know that P(A)=0.50. Are they the same? No! So A and B aren't independent. 

d. Now, draw a Venn diagram yourself. You'll have two circles, one for each event. Is there are intersection? What is the probability of that intersection? What is the complement of the union?

Chapter One Q4 and Q10

These are the solutions to two questions in the textbook from Chapter 1.

4.            a.           10 just count the number of ‘elements’…the column on the left side

                b.            All brands of minisystems manufactured. This is because we have a sample of 10 minisystems. It says sample in the question. For sure there are many more than 10 minisystems being manufactured. Here we have a sample of ten from that population

                c.             Average price = 3140/10 = $314

                d.            $314

10.          a.            Quantitative; ratio scale of measurement
Age is a quantitative ratio variable. There is a meaningful zero.

                b.            Categorical; nominal scale of measurement

                c.             Categorical; ordinal scale of measurement since the responses can be ordered from earliest (high school) to latest (retirement)

                d.            Quantitative; ratio scale of measurement

                e.            Categorical; nominal scale of measurement


Thursday, June 9, 2011

Variance and standard deviation

Variance is a measure of the average distance of the observations from their mean. You can think of it as representing how ‘spread out’ the data are. Very widely dispersed means higher variance. Example: you always drink one cup of copy a day. The variance of your drinking is zero. All the observations are on the mean (which is 1). BUT if you sometimes drink three cups (like me) for breakfast, sometimes none at all, then the variance won’t be zero.

The standard deviation is just the square root of the variance. We use the standard deviation because the variance is in ‘squared units’.
You don’t need to work out the variance or the standard deviation by hand using the equation BUT you do need to know:

what the variance and standard deviation are
their relationship to each other
how to get them in Excel

Q36 Chapter 5

An alert and hardworking student has noticed that the answer in the textbook for Q36 on binomials seems to be wrong. I'd agree. I worked it out by hand and by Excel. You don't need the by hand part...I was just using this to double-check my work. Thanks to that student (no names, you know who you are!)



36.   a.     p = ¼ = .25

                
                 
                           

                            f (4) = BINOMDIST(4,20,.25,FALSE) = .1897

         b.     P(x  >  2)  =  1 – f(0) – f(1)

                 P(x  >  2)  =  1 – BINOMDIST(1,20,.25,TRUE) = .9757

         c.     f (12) = BINOMDIST(12,20,.25,FALSE) = .0008

                 And, with f (13) = .0002, f (14) = .0000, and so on, the probability of finding that 12 or more investors have exchange-traded funds in their portfolio is so small that it is highly unlikely that p = .25. In such a case, we would doubt the accuracy of the results and conclude that p must be greater than .25.

         d.     m  =  n p  =  20 (.25)  =  5

Friday, May 13, 2011

About your projects

I've had a couple of folks talk to me about their plans, which are interesting. If you have an idea you'd like to discuss, you can e-mail me or drop by the office hours (Monday and Wednesday 2-4 in the Learning Centre).

You work on your own, but you can of course discuss your project with anyone. You can collect the data yourself, or use publicly available data, for example from the Internet. Happy to help you find data.

Reg Peplow Award

My father passed away last year, and to remember him I have started an award at Kwantlen. The award is for the student in BUQU 1230 who shows the most improvement. This doesn't mean the student with the highest marks. It does mean that someone who (for example) doesn't do very well on the first mid-term but who then starts to really work hard has a good chance of winning. It is worth about $700. Given out at the Kwantlen School of Business award ceremony in March.

Nice bar chart video example

The Economist magazine usually has terrific graphics, and here is an example of a set of bar charts of days off taken in different countries:

http://www.economist.com/node/21256080

This shows how it is possible to make stats really interesting.

Sunday, May 8, 2011

Ordinal variable example

Here is a question asking about type of variable (or scale of measurment, same thing)

One of the questions in the survey  is as follows: When did you first start reading the WSJ? High school, college, early career, mid-career, late career, or retirement?

There is an order here: you go to high school, then college and so on. So we know that the variable is at the least ordinal. But is it interval? For it to be interval there would have to be some sort of consistent set of units in the intervals. Here we don't have that. We could describe high school as 1, college as 2 and so on.....but could not do any arithmetic operations on them.

So this variable is ordinal and so is categorical.