WEBVTT - autoGenerated
00:00:00.000 --> 00:00:06.000
From now on, it will be recorded.
00:00:06.000 --> 00:00:14.000
Let me perhaps briefly go through the beginning again, so that the recording is complete.
00:00:14.000 --> 00:00:22.000
So we want to prove the Kursi-Schwartz inequality, which says if X and Y are any random variables,
00:00:22.000 --> 00:00:31.000
then the covariance squared of X and Y is always less than or equal than the product
00:00:31.000 --> 00:00:38.000
of the variances X of X and Y, so the product of V of X times V of Y.
00:00:38.000 --> 00:00:40.000
And the proof goes as follows.
00:00:40.000 --> 00:00:47.000
We just take any real number, lambda, and compute the variance of a new random variable
00:00:47.000 --> 00:00:55.000
lambda X plus Y, so this is like defining a new random variable, where applying our
00:00:55.000 --> 00:01:04.000
rules for variances, it is easy to see that the variance of lambda X plus Y is equal to
00:01:04.000 --> 00:01:09.000
lambda squared times the variance of X plus 2 times lambda times the covariance of X and
00:01:09.000 --> 00:01:13.000
Y plus the variance of Y.
00:01:13.000 --> 00:01:17.000
And this needs to be greater than or equal to 0, because a variance is always greater
00:01:17.000 --> 00:01:26.000
than or equal to 0, where actually that is typically greater than 0, unless this random
00:01:26.000 --> 00:01:34.000
variable lambda X plus Y is a concept, unless it is a degenerate random variable.
00:01:34.000 --> 00:01:36.000
And this is sort of the marginal case.
00:01:36.000 --> 00:01:42.000
The main case is that the variance is actually greater than 0, so that we have greater here.
00:01:42.000 --> 00:01:48.000
And this is the case which we consider first.
00:01:48.000 --> 00:01:55.000
So we are here at the non-degenerate case, where the variance of lambda X plus Y is greater
00:01:55.000 --> 00:01:58.000
than 0.
00:01:58.000 --> 00:01:59.000
What does this mean?
00:01:59.000 --> 00:02:09.000
This means that there is no lambda in R, which solves the quadratic equation lambda squared
00:02:09.000 --> 00:02:20.000
V of X plus 2 lambda covariance X and Y plus V of Y is equal to 0.
00:02:20.000 --> 00:02:22.000
This is this equation here.
00:02:22.000 --> 00:02:29.000
Suppose that we have equality here, then we have a quadratic equation in lambda.
00:02:29.000 --> 00:02:37.000
Lambda squared V of X plus 2 lambda covariance X and Y plus V of Y.
00:02:37.000 --> 00:02:45.000
Because it is equal to 0, well, then we would be able to solve this quadratic equation for
00:02:45.000 --> 00:02:58.000
lambda and find some lambda in R, which makes the sum lambda of X plus Y a degenerate random
00:02:58.000 --> 00:03:01.000
variable.
00:03:02.000 --> 00:03:08.000
Our assumption in going in this proof was there is no such lambda.
00:03:08.000 --> 00:03:15.000
For each lambda in R, we have the non-degenerate case so that the variance of lambda X plus
00:03:15.000 --> 00:03:24.000
Y for any lambda in R is positive, which means that we cannot find a lambda which solves
00:03:24.000 --> 00:03:31.000
this equation with an equality sign here, which would lead to lambda V of X plus 2 lambda
00:03:31.000 --> 00:03:36.000
covariance X and Y plus V of Y being equal to 0.
00:03:36.000 --> 00:03:40.000
There would be no real valued solution.
00:03:40.000 --> 00:03:42.000
This is this equation here.
00:03:42.000 --> 00:03:48.000
If it is true that this thing here is greater than 0 for all lambda in R, then we would
00:03:48.000 --> 00:03:55.000
not be able to solve this equation with a real valued solution for lambda.
00:03:55.000 --> 00:04:01.000
But you know from your math lectures that such a quadratic equation always has a solution
00:04:01.000 --> 00:04:04.000
in the space of complex numbers.
00:04:04.000 --> 00:04:10.000
So there is a solution and we can technically derive this solution.
00:04:10.000 --> 00:04:19.000
It is the well-known lambda 1-2 formula, which in this case would be lambda 1-2 is equal
00:04:19.000 --> 00:04:25.000
to minus the covariance of X and Y divided by variance of X plus minus the square root
00:04:25.000 --> 00:04:35.000
of this term here, covariance of X and Y divided by variance of X squared minus the ratio of
00:04:35.000 --> 00:04:38.000
the covariances V of Y and V of X.
00:04:38.000 --> 00:04:46.000
You can easily derive this equation here using your high school math.
00:04:46.000 --> 00:04:54.000
And what we now know is that if this equation here does not have any solution in the space
00:04:54.000 --> 00:05:00.000
of real numbers, so if it doesn't have a real valued solution, then apparently the term
00:05:00.000 --> 00:05:04.000
under the square root here must be negative.
00:05:04.000 --> 00:05:12.000
So in the case where the variance of lambda X plus Y is greater than 0 for all lambda,
00:05:12.000 --> 00:05:18.000
then we know that this term here must be negative because only if the term under the square root
00:05:18.000 --> 00:05:24.000
is negative, only then would we have complex valued solutions for lambda.
00:05:24.000 --> 00:05:30.000
Otherwise, if this term were 0 or positive, then we would have real valued solutions for
00:05:30.000 --> 00:05:38.000
lambda, but our presupposition was that there is no real valued solution for lambda.
00:05:38.000 --> 00:05:42.000
So this term here must be negative.
00:05:42.000 --> 00:05:44.000
So this we can use now.
00:05:44.000 --> 00:05:49.000
We know that this term here must be negative in the case where the variance of lambda X
00:05:49.000 --> 00:05:51.000
plus Y is positive.
00:05:51.000 --> 00:05:58.000
So we have the covariance of X and Y divided by variance of X squared is smaller than the
00:05:58.000 --> 00:06:05.000
ratio of the two variances, V of Y and V of X, which simply implies by multiplying
00:06:05.000 --> 00:06:12.000
with V of X squared that the covariance of X and Y squared is smaller than the product
00:06:12.000 --> 00:06:16.000
of the variances, V of X and V of Y, and this was our claim.
00:06:16.000 --> 00:06:18.000
So this we have proved now.
00:06:18.000 --> 00:06:27.000
This was the Cauchy-Schwartz inequality for the case where lambda of X plus Y has a positive
00:06:27.000 --> 00:06:32.000
variance throughout.
00:06:32.000 --> 00:06:34.000
First part of the proof.
00:06:34.000 --> 00:06:39.000
Second part of the proof is concerned about the degenerate case.
00:06:39.000 --> 00:06:47.000
Suppose there is some lambda such that lambda X plus Y is constant so that the variance
00:06:47.000 --> 00:06:50.000
of lambda X plus Y is 0.
00:06:50.000 --> 00:06:58.000
So C is just a constant and if it is possible to find some value of lambda such that lambda
00:06:58.000 --> 00:07:06.000
X plus Y is a constant, then obviously the variance of lambda X plus Y would be 0.
00:07:06.000 --> 00:07:12.000
So first thing we do is supposing that we have such a lambda that we take the expectation
00:07:12.000 --> 00:07:14.000
of this equation here.
00:07:14.000 --> 00:07:19.000
The expectation of this equation here is just this equation, lambda times the expected value
00:07:19.000 --> 00:07:26.000
of X plus the expected value of Y would also be C, would also be equal to C.
00:07:26.000 --> 00:07:32.000
Now subtract the second equation here from the first equation in order to center the
00:07:32.000 --> 00:07:40.000
random variables X and Y, then we would have lambda times X minus its expected value plus
00:07:40.000 --> 00:07:47.000
Y minus its expected value is equal to 0 because subtracting these two equations here, C minus
00:07:47.000 --> 00:07:54.000
C is equal to 0 and therefore we can also write that Y minus mu Y is equal to minus
00:07:54.000 --> 00:08:01.000
lambda times X minus mu X for the specific value of lambda.
00:08:01.000 --> 00:08:10.000
So we have what we call an exact linear dependency between Y and X. Y, at least the centered
00:08:10.000 --> 00:08:18.000
variable Y, is just a linear transformation of the centered variable X. So there is an
00:08:18.000 --> 00:08:26.000
exact linear dependency between the two and the factor of linearity is this lambda here.
00:08:26.000 --> 00:08:30.000
It is then easy to calculate the covariance.
00:08:30.000 --> 00:08:37.000
The covariance of X and Y is, as you know, the covariance of X minus mu X and Y minus
00:08:37.000 --> 00:08:43.000
mu Y because computing the covariance, we always center the random variables and now
00:08:43.000 --> 00:08:53.000
we can replace Y minus mu Y by minus lambda times X minus mu X because we know that Y
00:08:53.000 --> 00:08:59.000
minus mu Y is just a linear function of X minus mu X.
00:08:59.000 --> 00:09:06.000
So it's the covariance of X minus mu X with minus lambda times X minus mu X, which implies
00:09:06.000 --> 00:09:12.000
that this is just the expectation of minus lambda times X minus mu X squared and the
00:09:12.000 --> 00:09:18.000
expectation of X minus mu X squared is, of course, the variance of X. So this is equal
00:09:18.000 --> 00:09:23.000
to minus lambda times the variance of X.
00:09:23.000 --> 00:09:30.000
If we square this equation here, then we get the squared covariance of X and Y is just
00:09:30.000 --> 00:09:38.000
equal to lambda squared times the variance of X squared.
00:09:38.000 --> 00:09:44.000
We are almost there then to complete the proof because we know that the variance of Y is
00:09:44.000 --> 00:09:51.000
the expectation of the centered random variable Y minus mu Y and that is by the linear dependency
00:09:51.000 --> 00:09:57.000
the expectation of lambda squared X minus mu X squared and therefore it is equal to
00:09:57.000 --> 00:10:01.000
lambda squared V of X.
00:10:01.000 --> 00:10:05.000
So watch out, this is lambda squared V of X here.
00:10:05.000 --> 00:10:09.000
Here we have lambda squared V of X squared.
00:10:09.000 --> 00:10:14.000
So there's just one V of X more in this expression than there is this expression.
00:10:14.000 --> 00:10:23.000
So we can substitute lambda squared V of X for V of Y in this expression here and therefore
00:10:23.000 --> 00:10:32.000
we get that V of Y V of X is equal to lambda squared V of X squared is equal to the covariance
00:10:32.000 --> 00:10:36.000
of X and Y squared.
00:10:36.000 --> 00:10:38.000
And here we are, right?
00:10:38.000 --> 00:10:46.000
Equation three here is exactly what we claim now in the limiting case where the squared
00:10:46.000 --> 00:10:53.000
covariance and the Cauchy-Schwarz inequality is equal to the product of the variances of
00:10:53.000 --> 00:10:56.000
Y and X.
00:10:56.000 --> 00:11:02.000
So equation three along with equation one which we had established in the previous case
00:11:02.000 --> 00:11:07.000
where the variance of lambda X plus Y was greater than zero.
00:11:07.000 --> 00:11:13.000
The two of them together imply that the covariance of X and Y squared is less than or equal to
00:11:13.000 --> 00:11:18.000
the product of the variances V of X, mu of Y and that is the claim of the Cauchy-Schwarz
00:11:18.000 --> 00:11:23.000
inequality which we have now proved.
00:11:23.000 --> 00:11:28.000
Are there any questions concerning this proof?
00:11:28.000 --> 00:11:33.000
No, I don't see any.
00:11:33.000 --> 00:11:37.000
Good, then let's proceed.
00:11:37.000 --> 00:11:46.000
As I have already said, a covariance as a measure of some type of dependency between
00:11:46.000 --> 00:11:54.000
two random variables is hard to interpret because the covariance can take in principle
00:11:54.000 --> 00:11:57.000
any value between minus infinity and plus infinity.
00:11:57.000 --> 00:12:03.000
By just changing units, the covariance can be increased or decreased.
00:12:03.000 --> 00:12:10.000
The only information we get from a covariance is whether the covariance is negative or positive
00:12:10.000 --> 00:12:15.000
that cannot be changed by a choice of units obviously, but that is not a great deal of
00:12:15.000 --> 00:12:23.000
information and we would like to have some more knowledge about the strength of the dependency
00:12:23.000 --> 00:12:26.000
between two random variables.
00:12:26.000 --> 00:12:32.000
Therefore, we need some better measure than the covariance since it doesn't make sense
00:12:32.000 --> 00:12:38.000
to say the covariance is great or the covariance is small as I have already explained, it depends
00:12:38.000 --> 00:12:44.000
on the units in which the two random variables are measured.
00:12:44.000 --> 00:12:50.000
We will therefore normalize the covariance and for this normalization, we will use the
00:12:50.000 --> 00:12:52.000
Cauchy-Schwarz inequality.
00:12:52.000 --> 00:12:56.000
That's why we bothered about this inequality.
00:12:56.000 --> 00:13:02.000
The way to normalize this is that we define the coefficient of correlation.
00:13:02.000 --> 00:13:07.000
The definition follows as you probably still know from undergraduate statistics for any
00:13:07.000 --> 00:13:16.000
pair of random variables x and y, the coefficient of correlation rho index x, y is defined
00:13:16.000 --> 00:13:23.000
as the covariance between x and y divided by the standard deviations of x and y.
00:13:23.000 --> 00:13:30.000
A very easy concept actually just take the covariance and divide by the standard deviations.
00:13:30.000 --> 00:13:40.000
Then we already know the covariance is always less than or equal to the standard deviations.
00:13:40.000 --> 00:13:48.000
We know that this expression here is less than or equal to y.
00:13:48.000 --> 00:13:56.000
That's implied by the Cauchy-Schwarz inequality, rho squared for random variables x and y is
00:13:56.000 --> 00:14:02.000
equal to covariance of x and y squared divided by the variance of x and the variance of y,
00:14:02.000 --> 00:14:06.000
and that is less than or equal to one.
00:14:06.000 --> 00:14:12.000
Therefore we know that the coefficient of correlation must always lie between minus
00:14:12.000 --> 00:14:16.000
one and one, including the two limiting cases.
00:14:16.000 --> 00:14:24.000
It is in the case of a perfect linear dependency between x and y possible that rho takes on
00:14:24.000 --> 00:14:31.000
the value minus one and or the value one, so we may have coefficients of correlations
00:14:31.000 --> 00:14:37.000
which are actually equal to minus one or equal to one, but the coefficient of correlation
00:14:37.000 --> 00:14:44.000
cannot be any greater than one or any smaller than minus one.
00:14:44.000 --> 00:14:47.000
That follows from Cauchy-Schwarz.
00:14:47.000 --> 00:14:54.000
So now we have a normalized measure of a dependency between two random variables x
00:14:54.000 --> 00:15:02.000
and y, and we can clearly see whether this dependency is strong or weak.
00:15:02.000 --> 00:15:11.000
We would say that the dependency is strong if rho is close in absolute value to one.
00:15:11.000 --> 00:15:18.000
If there is an exact positive linear association between x and y, then of course we would have
00:15:18.000 --> 00:15:25.000
rho being equal to one, and such an exact positive linear association between x and
00:15:25.000 --> 00:15:31.000
y would be the same thing as having lambda being negative in equation two in the proof
00:15:31.000 --> 00:15:35.000
of the Cauchy-Schwarz inequality.
00:15:36.000 --> 00:15:41.000
Conversely, if there is an exact negative linear association between x and y, then rho would
00:15:41.000 --> 00:15:50.000
be equal to minus one, and that would correspond to positive lambda in equation two.
00:15:50.000 --> 00:15:55.000
Very interesting is always the case that there is no linear association between x and y at
00:15:55.000 --> 00:15:57.000
all.
00:15:57.000 --> 00:16:03.000
That would be expressed by the fact that rho is equal to zero, so that we would have no
00:16:03.000 --> 00:16:09.000
co-movement, apparently no dependency between x and y.
00:16:09.000 --> 00:16:15.000
You have to be careful with the burden here, because actually the fact that the coefficient
00:16:15.000 --> 00:16:21.000
of correlation is equal to zero does not yet imply that there is no dependency at all between
00:16:21.000 --> 00:16:22.000
x and y.
00:16:22.000 --> 00:16:27.000
So it does not imply that x and y are independent in the statistic sense, that they are not
00:16:27.000 --> 00:16:34.000
stochastically independent, or at least not necessarily so, but we would not have any
00:16:34.000 --> 00:16:37.000
type of correlation.
00:16:37.000 --> 00:16:43.000
Or if you want to express it in terms of covariances, we would know that rho of x and y equal to
00:16:43.000 --> 00:16:51.000
zero implies that the covariance between x and y is equal to zero, because this is here
00:16:51.000 --> 00:16:54.000
in the numerator of this expression.
00:16:54.000 --> 00:16:59.000
So if rho is equal to zero, then obviously the covariance between x and y must be equal
00:16:59.000 --> 00:17:08.000
to zero, and loosely speaking, it means that there is at least no strong dependency between
00:17:08.000 --> 00:17:14.000
x and y.
00:17:14.000 --> 00:17:20.000
So we have a measure now, the coefficient of correlation, which gives us some standard
00:17:20.000 --> 00:17:31.000
by which we can judge what strength of linear association we have between two variables.
00:17:31.000 --> 00:17:36.000
And as I already said, if rho were to take a value close to zero, then we would think
00:17:36.000 --> 00:17:40.000
that this association is rather weak.
00:17:40.000 --> 00:17:48.000
As a rule of thumb in empirical work, you typically take correlations which are, say,
00:17:48.000 --> 00:17:56.000
less than 0.3 or even 0.2, and absolute value is weak correlations.
00:17:56.000 --> 00:18:04.000
And correlations which are greater than, say, 0.5 are considerable linear associations between
00:18:04.000 --> 00:18:12.000
variables, which you should typically pay some attention to when you work empirically.
00:18:13.000 --> 00:18:23.000
Obviously, the closer the values of the coefficient of correlation are to plus and or minus 1,
00:18:23.000 --> 00:18:29.000
the stronger the interdependency between variables x and y is.
00:18:29.000 --> 00:18:33.000
So you may have rather strong associations up to the point where you would say perhaps
00:18:33.000 --> 00:18:39.000
variable x doesn't give you much additional information from what the information which
00:18:40.000 --> 00:18:47.000
variable y gives, because they are so closely correlated that actually it seems they basically
00:18:47.000 --> 00:18:54.000
indicate same thing, and therefore the additional informational value of the second variable
00:18:54.000 --> 00:18:57.000
would not be very great anymore.
00:18:57.000 --> 00:19:05.000
Okay, this was correlation and its relationship to covariancies.
00:19:05.000 --> 00:19:13.000
Let us now come to independence, which, as I already have mentioned, is a stronger concept
00:19:13.000 --> 00:19:21.000
than being of a correlation of zero, than having a coefficient of correlation of zero.
00:19:21.000 --> 00:19:27.000
So let us first formally define independence of two random variables, which I call x and
00:19:27.000 --> 00:19:31.000
y again, and the definition is rather easy.
00:19:31.000 --> 00:19:38.000
X and y are called independent, sometimes we say stochastically independent, if and
00:19:38.000 --> 00:19:45.000
only if for all values x and y, which the random variables capital X and capital Y may
00:19:45.000 --> 00:19:57.000
assume, it is true that the joint density f x y of value small x small y is equal to
00:19:57.000 --> 00:20:06.000
the marginal densities f of x times total f x of x times f y of y.
00:20:06.000 --> 00:20:13.000
So if the joint density is equal to the product of the marginal densities, then we say that
00:20:13.000 --> 00:20:17.000
random variables x and y are independent.
00:20:17.000 --> 00:20:25.000
And this equation here, this equality here needs to hold for all values of x and y in
00:20:25.000 --> 00:20:32.000
the domain of the random variables x and y.
00:20:32.000 --> 00:20:40.000
Again the definition applies to both continuous and discrete random variables.
00:20:40.000 --> 00:20:46.000
If we just focus on discrete random variables, then we can rewrite the independence condition
00:20:46.000 --> 00:20:53.000
in terms of probabilities, which as you know, we cannot do in the case of continuous random
00:20:53.000 --> 00:20:58.000
variables because they are all the probabilities of taking a single value of small x or a single
00:20:58.000 --> 00:21:02.000
value for y are equal to zero.
00:21:02.000 --> 00:21:08.000
But in the case of discrete random variables, we would have as an equivalent definition
00:21:08.000 --> 00:21:15.000
of independence that the probability, the joint probability for x taking on variable,
00:21:15.000 --> 00:21:23.000
taking on value small x and random variable y taking on value small y, the joint probability
00:21:23.000 --> 00:21:30.000
of this event here would have to be equal to the probability of capital X being equal
00:21:30.000 --> 00:21:36.000
to small x times the probability of y capital Y being equal to small y.
00:21:36.000 --> 00:21:42.000
So to the product of the two marginal probabilities.
00:21:42.000 --> 00:21:50.000
I leave it as an exercise for you to give a proof of this particular case and that it
00:21:50.000 --> 00:22:00.000
is implied and implies our definition of independence in the case of discrete random variables.
00:22:00.000 --> 00:22:08.000
Notationally we use this notation here for x and y being independent.
00:22:08.000 --> 00:22:15.000
So you read this just as x is independent of y, but the understanding is actually that
00:22:15.000 --> 00:22:18.000
this is a mutual thing.
00:22:18.000 --> 00:22:27.000
Y is also independent of x and the symbol we use is like a capital pi, the Greek capital
00:22:27.000 --> 00:22:34.000
pi, a letter turned on its head.
00:22:34.000 --> 00:22:44.000
So what kind of implications do we have from the independence definition that I just introduced?
00:22:44.000 --> 00:22:50.000
So the first implication is that independence implies zero covariance.
00:22:50.000 --> 00:22:57.000
So if x and y are stochastically independent, then it is implied that the covariance between
00:22:57.000 --> 00:22:59.000
x and y is zero.
00:22:59.000 --> 00:23:07.000
Obviously, this also implies that the correlation between x and y is zero, because recall in
00:23:07.000 --> 00:23:14.000
the numerator of the coefficient of correlation, we had the covariance.
00:23:14.000 --> 00:23:21.000
So if the covariance is equal to zero, then obviously the correlation is also equal to zero.
00:23:21.000 --> 00:23:28.000
Moreover, independence implies that the conditional mean equals the unconditional mean.
00:23:28.000 --> 00:23:36.000
So if x and y are independent, then this implies that the expectation of y given x
00:23:36.000 --> 00:23:40.000
is equal to the expectation of y.
00:23:40.000 --> 00:23:45.000
The same holds, of course, for the expectation of x given y, that would be equal to the expectation
00:23:45.000 --> 00:23:49.000
of x.
00:23:49.000 --> 00:23:56.000
What is important to note is that uncorrelated random variables are not necessarily independent.
00:23:56.000 --> 00:24:04.000
So if the coefficient of correlation between x and y is equal to zero, this does not yet
00:24:04.000 --> 00:24:10.000
necessarily imply that x and y are independent.
00:24:10.000 --> 00:24:14.000
Independence is a stronger concept than correlation.
00:24:14.000 --> 00:24:22.000
However, in the important case of normally distributed variables, a correlation of zero,
00:24:22.000 --> 00:24:28.000
so coefficient of correlation being equal to zero, does already imply independence of
00:24:28.000 --> 00:24:30.000
the two random variables.
00:24:30.000 --> 00:24:35.000
Because it is a general property of the normal distributions, we will speak of the normal
00:24:35.000 --> 00:24:41.000
distribution a little later in chapter two, it is a general property of the normal distribution
00:24:41.000 --> 00:24:46.000
that is completely characterized already by its first moment.
00:24:46.000 --> 00:24:52.000
So I haven't introduced the concept of a moment yet, but we'll come to this soon.
00:24:52.000 --> 00:24:57.000
It is characterized by the expectation of the random variable, so if the normal distribution
00:24:57.000 --> 00:25:04.000
is characterized, it's fully characterized by its expectation and by the variance and
00:25:04.000 --> 00:25:10.000
covariancies of the random variables which are jointly normally distributed.
00:25:10.000 --> 00:25:15.000
So these two numbers already give us all the information we need about the distribution
00:25:15.000 --> 00:25:21.000
of the variable in case we know that it is a normal distribution.
00:25:21.000 --> 00:25:27.000
So since it is already fully characterized by the variance and covariancies, a correlation
00:25:27.000 --> 00:25:35.000
of x and y in the case of normally distributed variables implies that x and y are independent.
00:25:35.000 --> 00:25:42.000
But that is not true necessarily for variables which are not normally distributed.
00:25:42.000 --> 00:25:50.000
So in the general case, x and y may not be independent even though their coefficient
00:25:50.000 --> 00:25:54.000
of correlation is zero.
00:25:54.000 --> 00:26:04.000
So if two variables are uncorrelated, it is still possible that they are dependent via
00:26:04.000 --> 00:26:11.000
some nonlinear relationship.
00:26:11.000 --> 00:26:18.000
We come back to this in a minute, but first let us look at some properties in more detail.
00:26:18.000 --> 00:26:24.000
One important property which we will often make use of is the independence under continuous
00:26:24.000 --> 00:26:27.000
transformations.
00:26:27.000 --> 00:26:32.000
This property is very easy actually to state and I think it is also easy to understand.
00:26:32.000 --> 00:26:38.000
If x and y are independent random variables and we have two continuous functions, let's
00:26:38.000 --> 00:26:46.000
call them g and h, then we can define two new random variables as capital G and capital
00:26:46.000 --> 00:26:55.000
H where capital G is the continuous function small g of x of the random variable x and
00:26:55.000 --> 00:27:00.000
h is the function small h of the random variable y.
00:27:00.000 --> 00:27:07.000
So these two new random variables g and h are also independent random variables.
00:27:07.000 --> 00:27:12.000
They inherit the property of being independent from x and y.
00:27:12.000 --> 00:27:17.000
This is some type of, you may call it grandfathering, something like that.
00:27:17.000 --> 00:27:23.000
x and y being independent imply that any kind of continuous transformation of the variables
00:27:23.000 --> 00:27:30.000
x and y and random variables based on these type of transformations, so newly defined
00:27:30.000 --> 00:27:37.000
as being these transformations, are also independent random variables.
00:27:38.000 --> 00:27:44.000
In fact, what I have stated here, for those of you who have had some more statistics already
00:27:44.000 --> 00:27:53.000
in previous lectures, what I've stated here is just the most important aspect of this
00:27:53.000 --> 00:27:58.000
independence under transformations because it is not only the case that we have this
00:27:58.000 --> 00:28:04.000
independence being preserved under continuous transformations, but rather the independence
00:28:04.000 --> 00:28:11.000
is preserved under any measurable function g of h, for those of you who know what a measurable
00:28:11.000 --> 00:28:12.000
function is.
00:28:12.000 --> 00:28:18.000
I will not use the concept of measurability in this lecture.
00:28:18.000 --> 00:28:25.000
It's kind of a technical concept, and we will not spend time on introducing it.
00:28:25.000 --> 00:28:28.000
Actually, we don't really need it in this lecture.
00:28:28.000 --> 00:28:34.000
Just for those of you who have some more depth knowledge of statistics and have heard of
00:28:34.000 --> 00:28:40.000
the concept of measurability, I would like to note that this property here is actually
00:28:40.000 --> 00:28:45.000
more general than I have currently stated it.
00:28:45.000 --> 00:28:51.000
For all the others, it is completely sufficient to know that we have the independence under
00:28:51.000 --> 00:28:56.000
continuous transformations, which are the type of transformations that we will use over
00:28:56.000 --> 00:28:59.000
and over again in this lecture.
00:28:59.000 --> 00:29:05.000
Perhaps you just remember that continuous functions are a special case of measurable
00:29:05.000 --> 00:29:08.000
functions.
00:29:08.000 --> 00:29:21.000
Now, let us use these properties along with the rules for variance calculation that we
00:29:21.000 --> 00:29:25.000
have covered in the last lecture.
00:29:25.000 --> 00:29:31.000
You recall that the variance of any random variable x is the same thing as the expectation
00:29:31.000 --> 00:29:35.000
of the squared x minus the expected value of x squared.
00:29:35.000 --> 00:29:40.000
We used this already at the beginning of this lecture when we approved the Cauchy-Schwarz
00:29:40.000 --> 00:29:42.000
inequality.
00:29:42.000 --> 00:29:47.000
Therefore, we now can derive a couple of properties.
00:29:47.000 --> 00:29:54.000
For instance, we may say that if x is a random variable and a and b are constants, such that
00:29:54.000 --> 00:30:04.000
z is equal to a plus b times x, then obviously the variance of z is equal to b squared times
00:30:04.000 --> 00:30:07.000
the variance of x.
00:30:07.000 --> 00:30:18.000
That's rather easy to see, since a is just a constant.
00:30:19.000 --> 00:30:28.000
Now if we look at two random variables x and y again, and we have the variance of a times
00:30:28.000 --> 00:30:35.000
x plus b times y, where a and b are just any constants, any real numbers, then we know
00:30:35.000 --> 00:30:42.000
that this is equal to a squared times the variance of x plus b squared times the variance
00:30:42.000 --> 00:30:50.000
of y, so just the quadratic terms of this and this term here, plus 2 times a times b
00:30:50.000 --> 00:31:02.000
times the covariance of x and y, which then implies that the variance of x and y in the
00:31:02.000 --> 00:31:10.000
case where x and y are independent is just the sum a squared times the variance of x
00:31:10.000 --> 00:31:15.000
plus b squared times the variance of y, because the covariance of x and y in the case where
00:31:15.000 --> 00:31:19.000
x and y are independent is 0.
00:31:19.000 --> 00:31:24.000
So we can forget about this last term here in the case where x and y are independent
00:31:24.000 --> 00:31:30.000
of each other, and then we know that the variance of a times x plus b times y, given that x
00:31:30.000 --> 00:31:35.000
and y are independent, is equal to a squared times the variance of x plus b squared times
00:31:35.000 --> 00:31:42.000
the variance of y.
00:31:42.000 --> 00:31:46.000
Are there any questions so far?
00:31:46.000 --> 00:31:50.000
Please wave if you want to pose a question.
00:31:50.000 --> 00:31:56.000
I don't see that any questions have come in.
00:31:56.000 --> 00:32:00.000
I don't see anybody raising hands.
00:32:00.000 --> 00:32:05.000
Okay, good.
00:32:05.000 --> 00:32:09.000
Now we move on to conditional expectations.
00:32:09.000 --> 00:32:18.000
I told you already when we introduced the concept of a conditional PVF last week, that
00:32:18.000 --> 00:32:26.000
conditionality is an important concept for economists who want to do causal analysis,
00:32:26.000 --> 00:32:34.000
because conditioning helps us to understand causal effects in the real world, since we
00:32:34.000 --> 00:32:44.000
may say that if there is a variable y which takes on a specific value given another random
00:32:44.000 --> 00:32:52.000
variable x, so given the value another random variable x takes on, then we may study situations
00:32:52.000 --> 00:32:59.000
in which the conditioning value x changes and nothing else possible changes to know
00:32:59.000 --> 00:33:09.000
how y will change when the conditioning value x changes, and we may then infer that x is
00:33:09.000 --> 00:33:17.000
causal for y, because the change in x has induced a certain change in y, so conditional
00:33:17.000 --> 00:33:23.000
expectations are something very useful, very informative, very important in statistical
00:33:23.000 --> 00:33:30.000
analysis and in economic analysis to establish causality, and causality is what the economist
00:33:30.000 --> 00:33:33.000
typically is interested in.
00:33:33.000 --> 00:33:39.000
Here is the definition of a conditional expectation.
00:33:39.000 --> 00:33:46.000
The conditional expected value of a random variable y conditional on a given value small
00:33:46.000 --> 00:33:56.000
x of this random variable x which conditions variable y, so the conditional expected value
00:33:56.000 --> 00:34:05.000
of y given some point x at which random variable capital X has settled, is a weighted averages
00:34:05.000 --> 00:34:11.000
of all possible y values where the weights are determined by the conditional probability
00:34:11.000 --> 00:34:19.000
density function of y given that x assumes value small x.
00:34:19.000 --> 00:34:27.000
That is a verbal definition which I have given here, it may not be so easy to digest it and
00:34:27.000 --> 00:34:36.000
understand it completely, it is certainly more helpful to look at it in a formal statement.
00:34:36.000 --> 00:34:41.000
Here I distinguish again between discrete and continuous random variables, let's start
00:34:41.000 --> 00:34:50.000
with the discrete case where y is discrete and x is discrete, so suppose y is discrete
00:34:50.000 --> 00:34:58.000
with k values, so y can take on either small y1 or small y2 or small x1 or small up to
00:34:58.000 --> 00:35:07.000
small y index k, then we say that the expectation of random variable y given that random variable
00:35:07.000 --> 00:35:20.000
x takes on a specific value small x is y1 times the conditional pdf of y given small
00:35:20.000 --> 00:35:30.000
x plus of course and then it repeats like this, plus then y2 the second possible value
00:35:30.000 --> 00:35:42.000
for y times the conditional pdf of y given x given x materializes at small x and then
00:35:42.000 --> 00:35:50.000
again the pdf of y2 of course given small x and these type of terms you just have to
00:35:50.000 --> 00:35:57.000
add up over all possible values y1 up to yk which y can assume.
00:35:57.000 --> 00:36:07.000
So the expectation of y given that x takes on value small x is just the sum over yj times
00:36:07.000 --> 00:36:17.000
the pdf of y given x at the point yj given small x, that's the exact definition of a
00:36:17.000 --> 00:36:25.000
conditional expectation in the case where y is discrete and f has taken on a particular
00:36:25.000 --> 00:36:36.000
value x. Actually this definition would even be correct if random variable x were continuous,
00:36:36.000 --> 00:36:42.000
so you can combine the concept of discrete and continuous random variables here in point
00:36:42.000 --> 00:36:49.000
a of this definition it is only important that y is discrete, then we operate with
00:36:49.000 --> 00:36:58.000
sums here and with a discrete range of values which y can assume but for x it doesn't really
00:36:58.000 --> 00:37:06.000
depend on whether x is discrete or continuous. Now suppose that y is continuous then obviously
00:37:06.000 --> 00:37:12.000
we have the same idea as in the discrete case of what the conditional expected value is
00:37:12.000 --> 00:37:21.000
just we replace the sigma sign here the sum over k possible values which y can assume
00:37:21.000 --> 00:37:28.000
by an integral, otherwise it's exactly the same. The expectation of y given that capital x is equal
00:37:28.000 --> 00:37:39.000
to sub-specific value small x is the integral over y times the pdf of y given x at point
00:37:39.000 --> 00:37:46.000
small y given small x and this integral runs over all possible values of small y.
00:37:47.000 --> 00:37:57.000
So again read this integral sign here just as summation in a continuous domain and then
00:37:57.000 --> 00:38:04.000
you have basically the same idea here as you have expressed it in the discrete case so that's the
00:38:04.000 --> 00:38:11.000
conditional expectation. Now note the conditional expectation
00:38:12.000 --> 00:38:22.000
e of y given that x assumes a specific value small x is a real number like we had it for the
00:38:22.000 --> 00:38:28.000
unconditional expectation. The expectation of capital x was always a real number, the expectation
00:38:28.000 --> 00:38:35.000
of capital y was always a real number so expectations are no random value variables
00:38:35.000 --> 00:38:44.000
anymore and here the same thing is true given that variable x random variable x has assumed
00:38:44.000 --> 00:38:52.000
some specific unique number small x the expectation of y is a real number.
00:38:52.000 --> 00:39:00.000
Actually this real number which the expectation of y given x is equal to small x takes on,
00:39:01.000 --> 00:39:07.000
this expectation is a function of small x so the expectation depends of course
00:39:08.000 --> 00:39:15.000
at least in general on the value small x. Therefore we may write the expectation of y
00:39:15.000 --> 00:39:24.000
given x is equal to small x as m index y of x so we can think of the conditional expectation of
00:39:24.000 --> 00:39:33.000
being a function of x it changes when small x changes so that's a function of x.
00:39:36.000 --> 00:39:42.000
x of course the small x may be just any number in the range of the random variable capital x.
00:39:42.000 --> 00:39:50.000
x you have to distinguish this concept of a conditional expectation for x taking on a specific
00:39:50.000 --> 00:40:01.000
value small x from a very closely related concept concept where y just depends on a random variable
00:40:01.000 --> 00:40:09.000
x without us specifying which value of small x this random variable x takes on.
00:40:10.000 --> 00:40:19.000
So we may rather than conditioning on a specific value of x sometimes want to emphasize that
00:40:19.000 --> 00:40:26.000
any specific x is the result of a random event because small x is of course also the result of
00:40:26.000 --> 00:40:33.000
a random event and therefore we may want to condition on a random variable rather than
00:40:33.000 --> 00:40:43.000
conditioning on a specific realization of this random variable. So this idea here suggests to
00:40:43.000 --> 00:40:53.000
condition the expectation of y in some cases on the random variable x rather than on a real number
00:40:53.000 --> 00:40:59.000
small x which means that symbolically we may think about defining something like the expectation
00:40:59.000 --> 00:41:12.000
of y given the random variable x but not specifying where x has materialized what kind of value small
00:41:12.000 --> 00:41:20.000
x random variable x has borne out what kind of realization x capital x has taken on.
00:41:21.000 --> 00:41:27.000
So we sometimes will use this concept here which is a completely different concept from the concept
00:41:27.000 --> 00:41:34.000
up here because this conditional expectation where x has realized at some specific x
00:41:35.000 --> 00:41:41.000
this concept is a real number whereas this concept here would still be a random variable
00:41:42.000 --> 00:41:49.000
right the expectation of y given random variable x is a random variable
00:41:49.000 --> 00:41:59.000
since we do not know yet which realization x takes therefore the expectation of y depends
00:41:59.000 --> 00:42:07.000
on the realization of x and is therefore by itself still a random variable. We may actually use the
00:42:07.000 --> 00:42:16.000
same function m y as up here but now we would not have a real number small x as the function's
00:42:16.000 --> 00:42:22.000
argument but rather we would have the random variable capital x as the function's argument
00:42:22.000 --> 00:42:29.000
and in this case we would have a function of a random variable here which would itself be
00:42:29.000 --> 00:42:36.000
a random variable. In most cases this m y would be a continuous function so that we may use the
00:42:36.000 --> 00:42:43.000
properties of continuous functions transforming random variables
00:42:45.000 --> 00:42:51.000
and may derive properties for the expectation of y given random variable x.
00:42:52.000 --> 00:42:57.000
Now I think I saw comments coming in here in the chat let me have a look.
00:42:57.000 --> 00:43:15.000
I don't see the chat perhaps I have to get out of the full screen in order to view it.
00:43:15.000 --> 00:43:28.000
Sorry I don't see the chat.
00:43:37.000 --> 00:43:38.000
I just don't see it.
00:43:38.000 --> 00:43:43.000
Ah here's the chat.
00:43:53.000 --> 00:43:59.000
Somebody's asking about the exam. I mean please everything which I do in the lecture
00:43:59.000 --> 00:44:06.000
is the possible basis of exam questions. On the other hand any exam questions
00:44:06.000 --> 00:44:13.000
needs to be formulated in a way that you can solve it in an appropriate amount of time.
00:44:14.000 --> 00:44:23.000
So please I cannot promise you that certain parts of the exams are not relevant for the exam
00:44:24.000 --> 00:44:28.000
but I can assure you that I cannot ask you everything which I cover in the lecture.
00:44:28.000 --> 00:44:38.000
What I expect is that you understand what I have shown you in this lecture
00:44:39.000 --> 00:44:46.000
and that in the case of easy problems you can apply this in the exam. There's not
00:44:46.000 --> 00:44:52.000
more than what I can more than that what I can tell you about this.
00:44:52.000 --> 00:44:59.000
Is it possible to have examples with these formulas in the lecture?
00:45:00.000 --> 00:45:06.000
I will give you some examples in the lectures but I have too many formulas to give you an example
00:45:06.000 --> 00:45:14.000
on everything. Actually where I'm currently at is still a review of what you should already know
00:45:14.000 --> 00:45:23.000
from your undergraduate education so I will be rather scarce in examples here. Later when I
00:45:23.000 --> 00:45:31.000
finish the review chapters I will provide more examples than I currently do. If you have trouble
00:45:31.000 --> 00:45:40.000
understanding the formula that I give to you then please consult a textbook on these things
00:45:41.000 --> 00:45:49.000
or take some of the questions you may have to the tutorial. Wunderdusken will be happy to discuss
00:45:49.000 --> 00:45:59.000
with you examples as much as he has time to do it which is of course also limited. So do not expect
00:45:59.000 --> 00:46:07.000
that we can provide examples on any everything I cover in the lecture. You know in the master's
00:46:07.000 --> 00:46:15.000
class and we expect you to also be able to work on your own and think of examples. I have given
00:46:15.000 --> 00:46:23.000
to you in the last lecture the coin tossing example. I will come back to this example later
00:46:23.000 --> 00:46:31.000
on in this lecture and use it to visualize some of the concepts that I have given to you here
00:46:31.000 --> 00:46:41.000
but not in every respect. So I ask you to think about the coin tossing example yourself and
00:46:41.000 --> 00:46:48.000
perhaps try to get more intuition of what I present to you if you lack intuition currently
00:46:49.000 --> 00:46:56.000
by looking at very easy examples like a discrete random variable of three tosses of a coin and ask
00:46:56.000 --> 00:47:02.000
yourself then well what is a conditional expectation and what are independent random
00:47:02.000 --> 00:47:10.000
variables and these kinds of questions to illustrate what I've done here if it is not
00:47:10.000 --> 00:47:20.000
clear to you by merely having the formulas. Yes another question on are we also supposed to do
00:47:20.000 --> 00:47:28.000
the proofs in the exam. Please do not ask me this question again once and for all the time.
00:47:28.000 --> 00:47:36.000
Yes I expect you to understand the proofs. That is not to say that I will ask you difficult proofs
00:47:36.000 --> 00:47:43.000
in the exam but I may ask you easy proofs or parts of easy proofs in the exam. It is very
00:47:43.000 --> 00:47:50.000
possible that I take a proof which I have presented in the lecture and ask you to do
00:47:50.000 --> 00:47:58.000
just a part of it in an exam. Not more than what you can handle in terms of time in an exam but it
00:47:58.000 --> 00:48:03.000
is not the case certainly not that I allow you to just skip the proofs saying well these are just
00:48:03.000 --> 00:48:10.000
proofs this is not part of a scientific education. Part of a scientific education is that you
00:48:10.000 --> 00:48:17.000
understand why things are true and not learn just by heart why things are true. So last time
00:48:17.000 --> 00:48:24.000
I comment on these type of questions I expect you not only to know the proofs but also to understand
00:48:24.000 --> 00:48:34.000
the proofs and to be able to reproduce the proofs at least in as much as is possible to ask you that
00:48:34.000 --> 00:48:37.000
in a final exam with limited time.
00:48:41.000 --> 00:48:48.000
Okay then we can go back to the slides.
00:48:56.000 --> 00:49:04.000
Let me put this aside. Good. As I noted there are two different concepts of conditional expectations.
00:49:04.000 --> 00:49:09.000
Either depending on real either expectations as real numbers or conditional expectations
00:49:09.000 --> 00:49:16.000
as random variables depending on whether you specify that the variable x the random variable
00:49:16.000 --> 00:49:22.000
x has already taken on a specific realization. In this case it's a real number or the random
00:49:22.000 --> 00:49:30.000
variable x having not yet taken a specific realization then the conditional expectation
00:49:30.000 --> 00:49:33.000
of y given x is itself a random variable.
00:49:38.000 --> 00:49:49.000
Okay if we look at the concept that we have the conditional expectation of y given random variable
00:49:49.000 --> 00:49:57.000
x as some function m y of x please note that this is just an intuition of what the conditional
00:49:57.000 --> 00:50:05.000
expectation of y given a random variable x is. It is not a definition it's an intuition.
00:50:06.000 --> 00:50:15.000
It is not possible to simply replace in our definition of a conditional expectation
00:50:15.000 --> 00:50:25.000
given that x has taken on a specific value small x of replacing the specific value small x
00:50:26.000 --> 00:50:33.000
with a capital x. You may think perhaps it's just analogous. Unfortunately it is not because the
00:50:33.000 --> 00:50:41.000
definition of the conditional expectation for x having having specific realization small x
00:50:42.000 --> 00:50:53.000
involve the conditional density f of y given x at a particular point y given small x. It is not
00:50:53.000 --> 00:51:00.000
possible to replace this expression here which we have used in the definition by this expression
00:51:00.000 --> 00:51:09.000
here f of y given x for y given a random variable x because this here is actually not a meaningful
00:51:09.000 --> 00:51:17.000
expression at all. This doesn't exist this type of expression here and therefore it is not so easy
00:51:17.000 --> 00:51:27.000
to formally define the expectation of y given x. This is only possible to do with probability
00:51:28.000 --> 00:51:34.000
theory elements which we have not covered in this lecture which are actually clearly beyond the
00:51:34.000 --> 00:51:41.000
scope of this lecture. So I will not give this definition and obviously I would ask you for it
00:51:41.000 --> 00:51:48.000
in the exam but you should know that there is of course an exact definition for this concept of an
00:51:48.000 --> 00:52:00.000
expected value y given x much as we have had the easier definition of y given x taking on a
00:52:00.000 --> 00:52:13.000
specific value small x so that there is this exact definition but for our purposes it is
00:52:13.000 --> 00:52:23.000
sufficient that we just provide some type of intuition on what this expected value is namely
00:52:23.000 --> 00:52:29.000
a transformation the function of some random variable capital x. So if you understand that
00:52:29.000 --> 00:52:39.000
this will be enough for this lecture. There are a number of rules which you should know
00:52:40.000 --> 00:52:46.000
which you should be able to reproduce and which you should be able to apply in cases they will be
00:52:46.000 --> 00:52:54.000
used throughout the lecture so best thing is you just learn them by heart. First rule if x and y
00:52:54.000 --> 00:53:00.000
are independent random variables then the conditional expectation of y given random
00:53:00.000 --> 00:53:07.000
variable x so the second concept of conditional expectation of y is just the expectation of y.
00:53:09.000 --> 00:53:19.000
This I think is actually quite an intuitive property if x and y are independent then they
00:53:19.000 --> 00:53:28.000
have no relation to each other so having some random variable x as given doesn't really change
00:53:28.000 --> 00:53:37.000
our expectation of y because x and y don't have any relation to each other so the expected value
00:53:37.000 --> 00:53:44.000
of y given x should be equal to the what is actually equal to the expectation of y to the
00:53:44.000 --> 00:53:57.000
unconditional expectation of y. Second for any function g which we may have the expectation of g
00:53:57.000 --> 00:54:09.000
of x given x is just g of x right so if we have the expectation of a transformation of random
00:54:09.000 --> 00:54:20.000
variable x given random variable x then this is just this transformation of x because we suppose
00:54:20.000 --> 00:54:30.000
that random variable x is given. Further property three if the expectation we have the expectation
00:54:30.000 --> 00:54:43.000
of g of x times random variable y plus h of x given x then given random variable x h of x is by
00:54:44.000 --> 00:54:53.000
rule two or the expectation of h of x given x is just h of x and the expectation of g of x
00:54:53.000 --> 00:55:03.000
given x is just g of x so the expectation of g of x times y plus h of x given x
00:55:04.000 --> 00:55:10.000
is a term where we can sort of pull out of the expectation operator everything which is related
00:55:10.000 --> 00:55:20.000
to x so it is g of x times the expectation of y given x plus h of x or to put it differently
00:55:21.000 --> 00:55:29.000
you may just pull the expectation operator in this expression here knowing that expectation
00:55:29.000 --> 00:55:39.000
of g of x given x is just g of x times then the expectation of y given x which is this thing here
00:55:40.000 --> 00:55:46.000
and then the expectation of h of x given x is from property two just h of x.
00:55:50.000 --> 00:55:56.000
These are actually special cases of what we call the law of iterated expectations and I will
00:55:56.000 --> 00:56:05.000
abbreviate this LIE. The law of iterated expectation says if we take the expectation
00:56:05.000 --> 00:56:12.000
of a conditional expectation and the conditional expectation is conditional on a random variable
00:56:12.000 --> 00:56:18.000
not on a specific value of x so if we take the expectation of a conditional expectation which
00:56:18.000 --> 00:56:24.000
is itself a random variable then this is just the unconditional expectation of y.
00:56:27.000 --> 00:56:37.000
So we have here a succession of expected values. First we take the expectation of y given x
00:56:38.000 --> 00:56:49.000
and we know that the expectation of y given x is just some mu y of x is just some transformation
00:56:49.000 --> 00:57:01.000
of the random variable x. Taking the expectation of this mu y of x gives us then the unconditional
00:57:01.000 --> 00:57:09.000
expectation of y just as if we never had had any knowledge on random variable x conditioning random
00:57:09.000 --> 00:57:15.000
variable y. This law of iterated expectations is a very important law which we will use
00:57:16.000 --> 00:57:25.000
very often in the lecture to compute properties of estimators. This is very helpful actually to break
00:57:25.000 --> 00:57:34.000
up an expected value into several steps at least two steps typically and then first solve one
00:57:34.000 --> 00:57:42.000
expectation and then solve the second expectation. So you should well know this law of iterated.
00:57:46.000 --> 00:57:54.000
I will give you here the proof of this law of iterated expectations for the discrete case
00:57:54.000 --> 00:58:01.000
and that will actually be the last thing which I do in this lecture because we wanted to have some
00:58:01.000 --> 00:58:09.000
Q&A in the last minutes so let me just go through the proof for the discrete case.
00:58:11.000 --> 00:58:15.000
The proof for the continuous case I will not give but it is basically analogous.
00:58:16.000 --> 00:58:23.000
Suppose that we have two random variables x and y which are both discrete and let's suppose that
00:58:23.000 --> 00:58:34.000
x takes on values x1 up to xl and y takes on variables y1 numbers y1 up to yk respectively
00:58:34.000 --> 00:58:45.000
so very standard. Now we may think about what is the expectation of the expectation of y given x.
00:58:46.000 --> 00:58:58.000
Well first thing what we do is that we look at the real numbers which variable x can take on.
00:58:58.000 --> 00:59:07.000
So this outer expectation here is the expectation of this inner expectation.
00:59:07.000 --> 00:59:18.000
The inner expectation is the expectation of y given that x takes on a certain value xj
00:59:19.000 --> 00:59:29.000
times the probability of x taking on this value xj and we have to compute this product here
00:59:29.000 --> 00:59:37.000
the conditional expectation of y taking on variable xj multiplied with the respective
00:59:37.000 --> 00:59:46.000
probability of x taking on value xj we have to compute this product here for all possible values
00:59:46.000 --> 00:59:54.000
that x can take on. So x can take on l different variables right up here l variables x1 to xl
00:59:55.000 --> 01:00:01.000
so this is the sum from j is equal to 1 up to l over these product terms here.
01:00:04.000 --> 01:00:09.000
Now let's look at the expectation of y given that x is equal to xj
01:00:11.000 --> 01:00:20.000
so this is what we have here in this bracket this is the expectation of y given that x is equal to
01:00:20.000 --> 01:00:29.000
xj well this is again a sum now going over all the k values that y may take on y may take on
01:00:30.000 --> 01:00:38.000
k different values and i index them here with i so this would be yi times the probability
01:00:38.000 --> 01:00:49.000
that y capital y the random variable y is equal to yi given that x is equal to xj
01:00:50.000 --> 01:01:01.000
because here we have x fixed at the value small xj or we have it in here so we have this product
01:01:01.000 --> 01:01:07.000
here of the value taken being taken on by the random variable times the associated probability
01:01:09.000 --> 01:01:17.000
okay now if we take this together here then we have the probability of y being equal to yi
01:01:17.000 --> 01:01:23.000
given x is equal to xj times the probability of x being equal to xj well this is just the
01:01:23.000 --> 01:01:32.000
probability of the joint event y being equal to yi and x being equal to xj and this of course
01:01:32.000 --> 01:01:44.000
is just the expected value of e of y as you can see now because we can take this joint
01:01:45.000 --> 01:01:52.000
probability here and take this joint probability and rewrite it in the sort of cross way
01:01:54.000 --> 01:02:00.000
here we had probability of y given x times probability of x is equal to the joint
01:02:00.000 --> 01:02:07.000
probability here that's the same thing as the probability of x given y times the probability
01:02:07.000 --> 01:02:17.000
of y taking on a particular value yi so we can also write the joint probability as the conditional
01:02:17.000 --> 01:02:24.000
probability times the marginal probability with x and y having changed position here
01:02:26.000 --> 01:02:35.000
well and what is this we can put the p of y equal to yi in the first position here because
01:02:35.000 --> 01:02:46.000
this is just multiplication so this thing here does not depend on j so basically we switch the
01:02:46.000 --> 01:02:53.000
order of the summation signs now we had the outer sign running over j and the inner sign running
01:02:53.000 --> 01:02:59.000
over i now we switch the order of the signs which we can always do if there are finitely many terms
01:02:59.000 --> 01:03:08.000
to be summed so this is the sum over the i's and now collecting all the terms which do not depend
01:03:08.000 --> 01:03:19.000
on j yi times p of y is equal to small yi so this yi here times p of y equal to small yi
01:03:20.000 --> 01:03:29.000
and then the sum of the probabilities which depend on j well we see this is just this term here the
01:03:29.000 --> 01:03:39.000
joint probabilities being summed over all j given some yi well this sum here is equal to 1
01:03:39.000 --> 01:03:47.000
because we just sum all the probabilities for one particular case so we just have the sum over the yi's
01:03:47.000 --> 01:03:55.000
times probability of y being equal to yi and that is the expected value of y and this completes the
01:03:55.000 --> 01:04:07.000
proof i'm aware that this proof may seem technical and difficult but i encourage you to
01:04:07.000 --> 01:04:15.000
look at it in detail at home and really reproduce each single step in the proof given the properties
01:04:15.000 --> 01:04:23.000
that we have established previously in this lecture then it should not be difficult to follow
01:04:23.000 --> 01:04:30.000
the proof and understand the proof of the law of independent of of iterated expectations
01:04:31.000 --> 01:04:38.000
as i told you we will make use of this law very often so it is important that you understand
01:04:39.000 --> 01:04:47.000
it well that you understand its meaning well and that you are able to apply it correctly in settings
01:04:47.000 --> 01:04:54.000
which require much attention sometimes in terms of formal representation in order not to mix the
01:04:54.000 --> 01:05:02.000
ys and dx but it is a very very helpful property and therefore as i said it is often used proof
01:05:02.000 --> 01:05:09.000
say of bias or unbiasedness of estimators so you you'll see it several times in this lecture
01:05:09.000 --> 01:05:20.000
i will stop here um and now we have uh some minutes left over um for