BUQU 1230: June 2011

Tuesday, June 21, 2011

zzz....z-scores!

Link to Youtube: z score calculations in Excel

Introduction
The z score of any individual observation tells us how far that individual observation is away from the mean of all the observations. The larger the z score, the further away. If the z score was zero, then the observation would be the same as the mean. It is òn`the mean, so no distance away. We use the z scores to detect outlying observations, and it is fundamentally important in inferential statistics. Before we start, make sure you're familiar with the concept of the standard deviation.

How we calculate z scores
The unit of measurement is the standard deviation....think about this. A metre or an inch is a unit of measurement. Why not use a standard deviation? The z score is measured in numbers of standard deviations. So the z score for any particular observation is the number of standard deviations that observation is away from the mean. Take a look at the sketch below.

We find the a z score by using this equation:

So the z score for any individual observation (that is the meaning of the little i) is the difference between the observation and the mean, divided by the standard deviation. So for example, if the mean was 10, the observation was 15, and one standard deviation was 5, then the z score would be 1. 15 is one standard deviation above the mean.

A z score can be negative as well as positive. A negative z score tells us that the observation is smaller than the mean. It is to the left of the mean.

z scores and outliers
We define an outlier as being an observation that is more than three standard deviations away from the mean. Because a z score is in units of standard deviations that means we can use the z score directly. This is what you do:

1. If you want to find the z score for any particular observation you can do the calculations by hand. Example: the observation is 600. The mean is 500. The standard deviation is 50. So the z score is:

(600-500)/50 = 2. So 600 has a z score of 2. 600 is two standard deviations away from the mean. It is not an outlier.

2. If you have a number of observations, you can use Excel. Please see the Youtube which uses the Earnings per share dataset, which is available to you in the Datasets folder. Go on, practise!

Saturday, June 18, 2011

Standard errors and why we are here!

Link to Youtube

The whole point of statistics is to use information from a sample to estimate a parameter in a population. Intuitively, wouldn't you agree that the larger the sample, the more accurate the estimate of the parameter? If we have a sample where n>=30, we can make use of the Central Limit Theorem to state that the sampling distribution of our favourite statistic, xbar, follows an approximately normal distribution (the website I just linked to has a very nice simulation).

The sampling distribution of xbar has its own mean and standard deviation. To avoid mixing up the standard deviation of the population and the standard deviation of the sampling distribution of xbar, we call the latter one the standard error. We find it by dividing the population standard deviation by the square root of the sample size, like this:

This is an incredibly important little equation which we'll see lots of times. Can you see that as the sample size increases, the SE (standard error) decreases?

All this has a highly practical use for estimating a population mean. Which is what we are trying to do in the first place. To start off, let's just imagine that we happen to know the population mean, and it is $51800 (this is from the EAI dataset). The population standard deviation is $4000. We draw a sample of size n = 30. The standard error is

xbar, or the sample mean from our sample of 30, is let's say $52300. What is the probability that the sample mean is within $500 of the population mean? Now we can use Excel for this, recall the function =normdist?

We want within $500 of the population mean, so that is from 51800 - 500 = 51300 to 51800 + 500 = 52300. Draw a little sketch....showing that we are looking for the area of the bit shaded in red. This will be the probability that the sample mean is within 500 plus or minus of the population mean.

We need to do two calculations in Excel, and then subtract one from the other to find the area in between, which is also the probability (of course!). Go

=normdist(52300,51800,730.3,true) and =normdist(51300,51800,730.3,true). You should end up with 0.7532 - 0.2468 = 0.5064. The meaning of this result: there is a 50/50 chance that the sample mean is within $500 of the population mean. Not too hopeful is it?

Now try the same thing all over again, BUT increase the sample size to n = 100. What happens? Think through what exactly is going on here. Here's my output...but please do it yourself! And take a look at the Youtube on this:

Thursday, June 16, 2011

Empirical Rule Worked Example

Q6. The vodka dispensing machine keeps malfunctioning. However----it can be regulated to give µ ounces (I don`t know: what measures did they use?). If the ounces of fill are normally distributed with a standard deviation of 0.4 ounces, at what value should we set µ so that a 6 ounce vodka mug will overflow only 2.5% of the time?

Answer: We know from the Empirical Rule that 95% of the observations are within 2 standard deviations of the mean. So that means that 5% of the observations are in the tails to the left and right of 95%. The normal distribution is symmetric, so each of the two tails contains 2.5% of the observations. We want to set the machine so that 6 ounces marks the point where 2.5% of the drinks overflow (dreadful waste of vodka if you ask me!). Look at my beautiful sketch. Make sure you can understand why having 2.5% in one tail means that the random variable that corresponds to that area must be 2 standard deviations from the mean.

We’re given that the standard deviation is 0.4 ounces. Two standard deviations is 0.4 * 2 = 0.8. So µ must be 0.8 ounces from 6 ounces. So we should set µ = 6 – 0.8 = 5.2 ounces.

Norminv and Normdist

Your boss says that 27% of customers order wine that costs above a certain amount. What is that amount. Mean 24, SD4.

Notice the word ‘above’. So we want to find the value of the random variable ‘x’....here the price of the wine....that separates the top 27% of the customers from the bottom---and here’s the key---73%. Recall that Excel goes from left to right. We know the percentage, want to find the value that gives that percentage. In this case, norminv is the boy! So go
=norminv(0.73,24,4)=26.45.

Now, a contrasting problem and more complex. In a restaurant, the bills are normally distributed. The mean bill is 28, and the standard deviation is 6. If 12 of the day’s bills are over 43.06 how many customers did the restaurant have that day?
First, find the percentage of the total bills occupied by the area to the right of 43.06. Here we have the value of the random variable and we want to get the area (or probability, whichever way you want to look at it). Recall that Excel adds from left to right....and our area is clearly to the right. It says ‘above’. So go
=1-normdist(43.06,28,6,true) and get 0.006.

This means that 0.6% of the restaurant’s bills were over 43.06.
Now, we know that 0.6% of the total number of bills is 12 in number. Call the total number of bills ‘L’. We are trying to find L.
0.006*L = 12.
Divide both sides by 0.006 to find L
L = 2000.

Correlation example

Here's an interesting (and beautifully written) example of correlation

http://www.economist.com/blogs/dailychart/2011/06/obesity-and-driving&fsrc=nwl

This would have made a neat project for someone!

Sunday, June 12, 2011

Independence explained

Here is a simple way to 'get' the meaning of independence in probability. First, recall the test for independence:

if P(A|B)=P(A) then A and B are independent. The probability of B doesn't affect the probability of A. Think of examples when this does and doesn't happen. Using examples can make these concepts easier to grasp.

Now, how do we get P(A|B) ? It is P(A|B) = P(AnB)/P(B) ...the little 'n' means the intersection. In words, p of a given b equals the intersection of a and b divided by the probability of b.

How do we get P(AnB)? By multiplying together P(A) * P(B). Think of a probability tree. So we can rewrite

P(A|B) as P(A) * P(B)/P(B). Now, probabilities are just numbers, so we can cancel out the P(B) on the right hand side, leaving just P(A). Do you get it now?

Saturday, June 11, 2011

Probability tree

Link to Youtube

This is question 11 from the old midterm in the Documentation folder

11. The ship has two entertainers, who aren't very good. One is a lady opera singer, and the other a dance instructor. They hate each other and act quite independently. The opera singer has a bad temper and the probability of her being in a good enough temper to sing is 0.4. The dance instructor is frequently drunk and unable to stand, let alone teach dancing. The probability of him being drunk is 0.6. Draw a probability tree and find:]

a. (6) That both of them are able to perform on any one night?

b. (6) The probability that one and only one of them can perform on any one night?

Solution: notice the word 'independently'. This means that you can multiply the probabilities. They are independent, so P(A|B) = P(A).

Draw a probability tree (see the Youtube, link at top of page).

Answer to 'a' is Probability singer sings * Probability dance instructor not drunk = 0.4 * 0.4 = 0.16

For 'b' the event contains two sample points---singer sings * dancer drunk + singer doesn't sign * dancer not drunk

Keenies 4 Q1

Link to Youtube.

		Company Size
		Large	Small-to-midsized
Stock	Yes	40	43	83
Options	No	149	137	286
		189	180	369

a.Offered stock options)

83/369

0.2249

P (small-to-midsized and did not offer stock options) 137/369 0.3713
P (small-to-midsized or offered stock options) (180 83 43)/369 0.5962
The probability of “small-to-midsized or offered stock options” includes the probability of “small-to-midsized and offered stock options”, the probability of “small-to-midsized but did not offer stock options” and the probability of “large and offered stock options”.

Friday, June 10, 2011

Probability and Addition Law Chapter 4 Q30 (adapted)

Link to Youtube

Suppose we have two events, A and B with P(A)=0.50 and P(B)=0.60 and P(AnB)=0.40. (Sorry...the n means intersection)

a. Find P(AUB). This is the union of A and B, or the probability that a sample point is in A or B or both. Use the Addition Law, to give P(AUB) = P(A)+P(B)-P(AnB = 0.50+0.60-0.40=0.70.

b. Find P(A|B) this is P(A) given P(B). The sample space is P(B). Use the formula so P(A|B) = P(AnB)/P(B)
= 0.40/0.60=0.67.

c. Are A and B independent (this is a good question!). If they were independent, then it would be true that
P(A|B)=P(A). In other words, whatever happens to B doesn't affect A. Let's check: is this true?

We know that P(A|B)=0.67. We know that P(A)=0.50. Are they the same? No! So A and B aren't independent.

d. Now, draw a Venn diagram yourself. You'll have two circles, one for each event. Is there are intersection? What is the probability of that intersection? What is the complement of the union?

Chapter One Q4 and Q10

These are the solutions to two questions in the textbook from Chapter 1.

4. a. 10 just count the number of ‘elements’…the column on the left side

b. All brands of minisystems manufactured. This is because we have a sample of 10 minisystems. It says sample in the question. For sure there are many more than 10 minisystems being manufactured. Here we have a sample of ten from that population

c. Average price = 3140/10 = $314

d. $314

10. a. Quantitative; ratio scale of measurement

Age is a quantitative ratio variable. There is a meaningful zero.

b. Categorical; nominal scale of measurement

c. Categorical; ordinal scale of measurement since the responses can be ordered from earliest (high school) to latest (retirement)

d. Quantitative; ratio scale of measurement

e. Categorical; nominal scale of measurement

Thursday, June 9, 2011

Variance and standard deviation

Variance is a measure of the average distance of the observations from their mean. You can think of it as representing how ‘spread out’ the data are. Very widely dispersed means higher variance. Example: you always drink one cup of copy a day. The variance of your drinking is zero. All the observations are on the mean (which is 1). BUT if you sometimes drink three cups (like me) for breakfast, sometimes none at all, then the variance won’t be zero.

The standard deviation is just the square root of the variance. We use the standard deviation because the variance is in ‘squared units’.

You don’t need to work out the variance or the standard deviation by hand using the equation BUT you do need to know:

what the variance and standard deviation are

their relationship to each other

how to get them in Excel

Q36 Chapter 5

An alert and hardworking student has noticed that the answer in the textbook for Q36 on binomials seems to be wrong. I'd agree. I worked it out by hand and by Excel. You don't need the by hand part...I was just using this to double-check my work. Thanks to that student (no names, you know who you are!)

36. a. p = ¼ = .25

f (4) = BINOMDIST(4,20,.25,FALSE) = .1897

b. P(x > 2) = 1 – f(0) – f(1)

P(x > 2) = 1 – BINOMDIST(1,20,.25,TRUE) = .9757

c. f (12) = BINOMDIST(12,20,.25,FALSE) = .0008

And, with f (13) = .0002, f (14) = .0000, and so on, the probability of finding that 12 or more investors have exchange-traded funds in their portfolio is so small that it is highly unlikely that p = .25. In such a case, we would doubt the accuracy of the results and conclude that p must be greater than .25.

d. m = n p = 20 (.25) = 5