WEBVTT - autoGenerated
00:00:00.000 --> 00:00:07.000
So the recording should be on now.
00:00:07.000 --> 00:00:14.000
And where is my slide here?
00:00:14.000 --> 00:00:16.000
Okay, thanks a lot.
00:00:16.000 --> 00:00:17.000
Good.
00:00:17.000 --> 00:00:39.000
So, the estimate of the price effect of a garbage incinerator being built in North and Dover would be the difference between the two coefficients that we obtain from simple regressions, namely I say it again for the recording which wasn't on at the beginning.
00:00:39.000 --> 00:01:00.000
We would take this estimate as the estimate which expresses by how much houses were less expensive in 1981 if they were close to the incinerator and subtract from this estimate that estimate which expresses by how much houses were less expensive than other
00:01:00.000 --> 00:01:20.000
houses in 1978, if they were close to the site where it would be reviewed a little later that an incinerator was to be built, and this difference is about $11,000 or $12,000 as we have seen so it is the difference between estimates and we call this difference
00:01:20.000 --> 00:01:35.000
delta 1 hat, and where we have an estimate now of the effect of the incinerator having been built in North and over on houses close to the incinerator site.
00:01:35.000 --> 00:01:46.000
We don't know whether this estimate is significant, unless we can come out with a theory on what the standard error of this estimate would be.
00:01:46.000 --> 00:02:11.000
And this we dealt with by combining the two estimates in a system of equations. We did this last time and I showed to you that we can actually estimate this delta 1 we are interested in a somewhat reparameterized regression where we have made changes to the matrix of regressors and to the
00:02:11.000 --> 00:02:35.000
vector of coefficients to obtain the coefficient of interest here, the delta 1 the difference between the price effect in 81 or actually the effect of the house being close to the incinerator site in 81 and house prices in 78 for houses close to the incinerator site.
00:02:35.000 --> 00:03:00.000
This is this thing, and this just required us to do some reshuffling of the regressor matrix here, so that using this regression now in which we stack both the 78 prices and the 81 prices as dependent variables, and then use this regressor matrix with four coefficients,
00:03:00.000 --> 00:03:19.000
we would get as the fourth coefficient our delta 1 and we would get the usual standard error there. So this was how far we came and I told you that this estimator is called the difference in differences or DID estimator.
00:03:19.000 --> 00:03:40.000
Now what I already pointed out last time was then that the regressor which is associated with this coefficient here is the last column in this matrix which has zero in the upper block or has lots of zeros in the
00:03:40.000 --> 00:04:02.000
1978 block of prices and has the dummy variable for the closeness to the incinerator site and the 1981 block of housing prices so that it turns out that this regressor is actually the same thing as multiplying the components of this factor here
00:04:02.000 --> 00:04:14.000
with the components of that factor, because then we would multiply the dummy variable here with lots of zeros.
00:04:14.000 --> 00:04:31.000
So this becomes a zero, and we would multiply the dummy variable n81 here with lots of ones, which makes this again n81. So this is the dummy very which then tells us where the house was close to the incinerator or was not.
00:04:32.000 --> 00:04:59.000
And therefore, what happens is actually that the DID regression takes on form number seven here, we take the dependent variable which is the stacked vector of dependent observations from both points in time, in our case, prices of 1978 and prices of 1981 and regress this on a constant, regress this on a
00:04:59.000 --> 00:05:16.000
dummy variable which indicates the treatment, regress this on a dummy variable which indicates the time period for which we do the regression so 1978 or 1981.
00:05:16.000 --> 00:05:29.000
So this would be one in the case of 1981 and would be zero in the case of 1978 so it's a simple time dummy, and then regress it on the product of the two regresses before.
00:05:29.000 --> 00:05:35.000
So a product of the time dummy with product of the treatment done.
00:05:35.000 --> 00:05:58.000
And this would give them the estimate for Delta one for the difference in the differences so difference in differences because we look at the difference between house prices close to the incinerator and house prices far off the incinerator and compare how this difference has changed over time
00:05:58.000 --> 00:06:09.000
so we look at the difference over time of the differences in location. That's why it's the difference in differences operator.
00:06:09.000 --> 00:06:24.000
Now as an exercise, please check that the representation we have or the representation I just gave you in equation seven corresponds exactly into the vector representation and six which I showed you before.
00:06:24.000 --> 00:06:48.000
So, make it clear to you that what I have written here as the regression equation for the differences and differences estimator is the same thing as this regression six year when we arrived to understand well the different notations.
00:06:48.000 --> 00:07:14.000
We will now look at the difference in differences estimator a little bit more systematically by discussing its properties are actually just by showing that the difference differences estimator is very easily computed as an estimator relying just on simple averages of the observations.
00:07:14.000 --> 00:07:35.000
And we need to understand which averages. These are so first thing in understanding why these are just averages over observations and nothing really complicated actually is that you know that the regressor matrix which we have established in equation six.
00:07:35.000 --> 00:07:44.000
And also in the other equations I've been discussing for regressor matrices which consisted just of zeros and ones.
00:07:44.000 --> 00:07:58.000
Right. I go back to this representation here you see the first vector is just a vector of ones. The second vector is just a vector of zeros and ones because this is a dummy variable indicating closeness to the incinerator.
00:07:58.000 --> 00:08:09.000
Here we have zeros and ones, and here we have zeros and then again zeros and ones. So the regressor matrix is just a matrix consisting of zeros and ones.
00:08:09.000 --> 00:08:33.000
Now when we multiply the regressor matrix with the dependent variable and obviously the zeros kick out certain observations and the ones some the remaining observations and the sum over certain observations is easily transformed and average.
00:08:33.000 --> 00:08:48.000
You see this systematically recall that when we have a usual regression equation like y is equal to x beta plus u, the regressor matrix x is always orthogonal to the estimated residuals u hat.
00:08:48.000 --> 00:08:58.000
So we know that x prime u hat is always equal to zero that we derived. I reminded you of this property in the review of basic econometrics.
00:08:58.000 --> 00:09:14.000
For u hat we can also write as this projector matrix here i minus x x plus times y so this thing is equal to zero and therefore we know that each column of x is orthogonal to u hat.
00:09:14.000 --> 00:09:30.000
Not only the constant is orthogonal to u hat but also each other column is orthogonal to u hat and therefore also the columns in our regressor matrix which consists partially of dummy variables.
00:09:30.000 --> 00:09:53.000
Now let us apply this to the results of the 1978 regression only. So this was the regression and the sort of naive regression where we regressed the prices of 1978 on constant and on a dummy variable which indicated the location of the horse
00:09:53.000 --> 00:10:01.000
prices so whether they were there where later on the incinerator would be built.
00:10:01.000 --> 00:10:12.000
The regression result is given here because I have written it in terms of the estimated coefficients so these are the gamma heads and here are the estimated residuals u hat.
00:10:12.000 --> 00:10:31.000
Now when I just sum all the prices here by pre-multiplying with a vector of ones so u dot prime p 78 just sums all the prices of 1978 then apparently I pre-multiply this expression here just by u dot prime.
00:10:31.000 --> 00:10:46.000
So I have u dot prime times the regressor matrix which is u dot and 78 times the estimated residuals plus zero because u dot prime is orthogonal on u hat.
00:10:46.000 --> 00:10:58.000
Okay, so u dot prime u dot is of course just the number of observations so this just counts how many observations we have so this is equal to n.
00:10:58.000 --> 00:11:13.000
And u dot prime times n 78 counts in just the same way all the houses which are close to the incinerator site. So this is the number of houses which are near the incinerator.
00:11:13.000 --> 00:11:26.000
This is why I use the index nr for near here. So we have n and nr times well the two estimated coefficients here.
00:11:27.000 --> 00:11:42.000
So this is now a one by two matrix so just one row two columns here one column two rows. So this gives a scalar we have n times gamma hat
00:11:43.000 --> 00:11:53.000
zero of 78 and nr times gamma hat one of 78. That's the same thing as u dot prime p 78.
00:11:53.000 --> 00:12:12.000
If we pre-multiply equation three by n 78, so by the dummy variable which just consists of zeros and ones, then we can proceed in just the same way.
00:12:12.000 --> 00:12:24.000
So here I have n 78 prime which is the dummy variable which more or less now counts all the prices which refer to houses close to the incinerator site.
00:12:24.000 --> 00:12:43.000
That would be the same thing as n 78 prime times u dot n 78 regressor matrix times the estimated coefficients plus again zero because also the regressor n 78 is orthogonal to the residuals you had.
00:12:44.000 --> 00:13:09.000
So and then n 78 times delta is just the number of houses close to the incinerator site and n 78 prime and 78 is the same thing. So both of them are just the number of houses close to the incinerator site and we have to add up the two regression coefficients here.
00:13:09.000 --> 00:13:26.000
So now we can subtract equation nine from equation eight. And if we do this, do it first on the left hand side, then you get u dot prime p 78 minus n 78 prime p 78. So it's u dot minus n 78 prime p 78.
00:13:26.000 --> 00:13:52.000
And here we get n minus n r times gamma hat zero of 78 because the gamma one hat 78 cancels in both cases is multiplied by the number of houses close to the incinerator site and n minus n r is n f r so it's the number of horses far from the incinerator site.
00:13:52.000 --> 00:14:07.000
U dot minus n 78 also counts the houses far off the incinerator site because this vector here is one precisely when houses are far from the incinerator site.
00:14:07.000 --> 00:14:17.000
And the zero otherwise, so we would count here the prices of houses which are far off the incinerator site.
00:14:17.000 --> 00:14:28.000
This means that gamma hat zero 78 is just the average of the 1978 house prices or houses which are far from the incinerator site.
00:14:28.000 --> 00:14:39.000
So very easy you get gamma zero hat 78 is one over the number of houses which are far from the incinerator site times the sum of all of the prices.
00:14:39.000 --> 00:14:47.000
I said this or I denote this by p bar 78 far.
00:14:47.000 --> 00:15:11.000
But by the same token then equation nine implies that the sum of those two estimated coefficients is the average of the 1978 house prices for houses near the incinerator site, because we see that the sum here is nothing else but counting all the houses which are near the incinerator
00:15:11.000 --> 00:15:25.000
site or better the prices of those houses and dividing by the number of certain such houses. So this is the average of the house prices in 78 for houses close to the incinerator.
00:15:25.000 --> 00:15:42.000
Subtract 12 from 11 no 11 from 12 sorry that we get gamma 178 hat is the same thing as the difference between those two averages that we have just before.
00:15:42.000 --> 00:16:02.000
And the same thing can be done for the 1981 regression where we would get that gamma 181 hat is the difference between the house, the average house price in 1981 of houses close to the incinerator minus the average house price in 1981 of houses far from the
00:16:02.000 --> 00:16:14.000
incinerator. And now we know the difference and differences estimator is the difference between gamma 181 hat, which we just have computed here.
00:16:14.000 --> 00:16:29.000
And a gamma 1978 hat, which we had computed on the previous slide. So we get it is the difference between differences of average prices. That's exactly the ID estimator.
00:16:30.000 --> 00:16:44.000
So actually you do not even need to run a regression, you do not even need to invert any kind of matrices, all you have to do is that you compute the correct averages of house prices as given in this formula.
00:16:44.000 --> 00:17:00.000
And then you take the difference in 1981 of house prices near and far from the incinerator and the same difference in 1978 of house prices near and far from the incinerator and subtract these two differences from each other.
00:17:00.000 --> 00:17:04.000
So you have a difference in differences.
00:17:04.000 --> 00:17:21.000
So that's exactly what happens in the regression numerically. This is the same delta one hat expressed here as differences from differences as the delta one hat, I would have obtained had an estimated equation six.
00:17:22.000 --> 00:17:32.000
However, this representation here coincides with the regression results only if we do not have any other covariates in the regression.
00:17:32.000 --> 00:17:47.000
So this is only valid if we have a regressor matrix which really has the form of equation six. So the four regressors which just consists of constant and dummy variables and the product of the data.
00:17:47.000 --> 00:17:53.000
The formula 15 does not apply to the ID estimates with additional currents.
00:17:53.000 --> 00:18:02.000
But of course we can include further explanatory variables, and we will do so in a few minutes.
00:18:02.000 --> 00:18:21.000
We will learn from including additional covariates that this is actually a good idea and that this may change results, quite substantially and quite reasonably actually such that we must be careful with the interpretation we've had so far for the differences
00:18:21.000 --> 00:18:34.000
estimator that this amount of 12,000 or almost $12,000 is the effect of the incinerator on house prices close to the incinerator.
00:18:34.000 --> 00:18:47.000
So, the DID framework offers some flexibility there we may include further explanatory variables, and actually we can also extend the framework to multiple time periods.
00:18:47.000 --> 00:19:00.000
Now the reason why we should include further explanatory variables if these are available are similar to the reasons why we should include covariates and other forms of causal analysis.
00:19:00.000 --> 00:19:20.000
So one main reason being that usually good covariates will help reduce the error variance and therefore the standard errors of the estimates, which means that the coefficient of interest, the delta one hat will be estimated more precisely.
00:19:20.000 --> 00:19:41.000
Secondly, when we do include covariates, then the interpretation of what we actually have estimated is easier, because we can distinguish different effects causal effects or other type of effects, some type of effect heterogeneity, for instance,
00:19:41.000 --> 00:19:53.000
from the causal effect we are looking at, namely the effect the construction of the incinerator had on house prices close to the incinerator.
00:19:53.000 --> 00:20:19.000
So, when we return to our example here, we know that equation seven now written in terms of the incinerator example would mean that we regress the vector of all prices 1978 and 1981 on a constant, then on a vector which contains the dummy variables for which indicates
00:20:19.000 --> 00:20:34.000
the location so close or far from the incinerator, then the dummy variable which indicates the time period so either 1978 or 1981 and then on the product of the two former regressors.
00:20:34.000 --> 00:20:38.000
So time dummy multiplied by location dummy.
00:20:38.000 --> 00:20:42.000
And the last term is of course the error term.
00:20:42.000 --> 00:20:49.000
So suppose first that this dummy variable is zero, and this dummy variable is also zero.
00:20:49.000 --> 00:20:58.000
Then of course this product here is also zero. So we would just explain pi by better not.
00:20:58.000 --> 00:21:21.000
This means that beta not is the average house price in 1978 for houses which are far from incinerator side, because this would be the estimate which we get when this, this dummy variable here is zero, and this is zero when hosts are far from the incinerator site.
00:21:21.000 --> 00:21:31.000
And it would be the estimate when the time that the variable is zero and well we give it a zero in the time variable when we have a year 1978.
00:21:31.000 --> 00:21:41.000
So this is why beta zero is the measure of the average house prices in 1978 for houses far off the incinerator side.
00:21:41.000 --> 00:21:55.000
What is delta zero? Well, delta zero gives them the time effect. So it captures the average change in all housing values from 1978 to 1981.
00:21:55.000 --> 00:22:10.000
Suppose you have any housing value given by these regressors here, then delta not describes to you how do those house prices change from 1978 to 1981.
00:22:10.000 --> 00:22:17.000
So that's the average change in all housing values from 1978 to 1981.
00:22:17.000 --> 00:22:29.000
The beta one indicates the mere location effect, mere location effect, meaning just the location but not yet the effect of the incinerator.
00:22:29.000 --> 00:22:47.000
So suppose you have any price being explained by a constant and by time and possibly by direction term here, then the beta one just changes the value of the houses when ni is one as opposed to zero.
00:22:47.000 --> 00:22:58.000
So the beta one just gives you the effect of a house being close to the incinerator rather than being far off incinerator.
00:22:58.000 --> 00:23:11.000
And data one then measures the decline in housing values due to the new incinerator, because this would be the effect of a house which is close to the incinerator and
00:23:11.000 --> 00:23:31.000
for a house price in 1981. So for sort of the additional effect of time on the house price when moving from 1978 to 1981. So that's differences in difference in effect.
00:23:31.000 --> 00:23:49.000
The assumption which we have implicitly made when interpreting the regression this way is that average house prices in 1978 and 1981 are not different for any other reason than the passage of time and the construction of the incinerator.
00:23:49.000 --> 00:24:02.000
So this just only this equation does only give us reasonable estimates when the only influences which change prices have probably been captured by the regressors of this equation.
00:24:02.000 --> 00:24:14.000
And well this equation just captures the influences of time and closeness to the incinerator and the construction of the incinerator.
00:24:14.000 --> 00:24:40.000
But it is obvious that the averages can only be compared. So the averages of prices in 1978 and averages of prices in 1981 can only be compared if the houses sold in 1978 are by all other characteristics, just sort of the same as the houses which we sampled in 1981.
00:24:40.000 --> 00:24:43.000
Now this is rather unlikely.
00:24:43.000 --> 00:24:54.000
For instance, already because houses sold in 1978 may have had different age than houses sold in 1981.
00:24:54.000 --> 00:25:13.000
And just by the passage of time, of course, houses in 1981 are probably older than houses in 1978 unless houses which existed in 1978 have been replaced by newly built houses between the three years 1978 and 1981.
00:25:13.000 --> 00:25:26.000
And in general, it would be quite a coincidence if the ages of the houses sampled in 1978, but the same as the ages of their houses sampled in 1981.
00:25:26.000 --> 00:25:35.000
Moreover, there's a tendency in real estate that houses become larger, more comfortable and better equipped.
00:25:35.000 --> 00:26:03.000
And since it is very likely that such effects interfere with our causal inference, it is reasonable to include covariates which capture important characteristics of houses.
00:26:03.000 --> 00:26:23.000
Now is that we augment the basic idea regression by including covariates, controls which measure the age of the houses, the distance to the next interstate, the land area which goes along with the house, the house area so the area you can actually live on,
00:26:23.000 --> 00:26:30.000
the number of rooms and the number of baths. This is the type of information we have available.
00:26:30.000 --> 00:26:55.000
Obviously, all these variables have some impact on house prices. And therefore, we can interpret our causal effect only if it happened to be the case that all the houses in 1978 came from the same kind of distribution of these
00:26:55.000 --> 00:27:09.000
characteristics here as the house prices in 1981. Not only the same distribution, but sort of matched or implied the same distribution as the house prices in 1981.
00:27:09.000 --> 00:27:24.000
In the regression which we now run, we will have all those covariates just enter the regression linearly with the exception of age, where we provide for a quadratic regressor too.
00:27:24.000 --> 00:27:44.000
Since it is not so reasonable that houses depreciate linearly in value over time. Actually after some time it may be that the loss in value is rather small or that a really old house is actually cherished by purchases and higher prices are being paid for it.
00:27:44.000 --> 00:27:49.000
So we should allow for some nonlinear care.
00:27:49.000 --> 00:28:09.000
In the table that is to follow, near-ins means that the house is near the incinerator. So it's close to the incinerator. So near-ins is the regressor which replaces our variable n78 or n81.
00:28:09.000 --> 00:28:20.000
And n81 is just the time dummy for 1981, so this corresponds to the dummy D81 which I have used in the previous notation.
00:28:20.000 --> 00:28:39.000
So here are now three regressions with different sets of covariates in the three columns here. So regression one, regression two, and regression three. All of them are DID regressions where the spec vector of prices is the dependent variable.
00:28:39.000 --> 00:28:56.000
And then we see certain regression results for the constant, for the year dummy, for the location dummy, and then for the interaction dummy between year and location, which gives us the data one coefficient we are interested in.
00:28:56.000 --> 00:29:01.000
So that's the causal effect of the incinerator if we do everything well.
00:29:02.000 --> 00:29:09.000
And then there are other controls, and I do not show you all the results for the other controls. They are perhaps not that interesting.
00:29:09.000 --> 00:29:21.000
It is not for our purposes. In equation one, we just have no other controls. In equation two, we just control for H. So we have two additional regressors, namely H and H squared.
00:29:22.000 --> 00:29:39.000
And in regression three, we include the full set of controls. So regression one is the regression we already know. The estimate of data one is minus approximately $12,000.
00:29:39.000 --> 00:29:53.000
So this is exactly the estimate we have already computed. And we see now that this estimate is actually not significant, because the standard error is 7.4000.
00:29:53.000 --> 00:30:18.000
So 11,800 or 11900, which we have here, is less than two times the standard error. So this estimate is not significant. So regression one does not tell us that the incinerator has had any negative impact on house prices, at least not a significant negative impact.
00:30:18.000 --> 00:30:29.000
The authors of the study had used somewhat more than 300 observations, and the R squared for this first regression without covariance was 0.17.
00:30:29.000 --> 00:30:41.000
Now in the second regression, we use two more controls, and the R squared then jumps considerably, actually to 41%, same number of observations.
00:30:41.000 --> 00:30:58.000
Let's look what happens to the effect of our data one. Suddenly, we estimate a much bigger negative price effect on the incinerator, of the incinerator on house prices, namely approximately $22,000.
00:30:58.000 --> 00:31:11.000
So it's almost twice as much as we have estimated in regression one. And this estimate here is significant, quite highly. So the t statistic is well beyond three.
00:31:11.000 --> 00:31:19.000
So here we really see that the incinerator has made a difference in house prices.
00:31:19.000 --> 00:31:35.000
In the third equation, we include also all the other covariates, and we see that again our inference changes quite strongly, because now the estimate of data one is much lower than it was when we just controlled for age.
00:31:35.000 --> 00:31:52.000
So now we estimate price difference of $14,000. So damage to house owners close to the incinerator side of about $14,000 rather than $22,000. So we lose about 33% of this estimate here.
00:31:52.000 --> 00:32:05.000
And we see that even though it is smaller, the estimate is still highly significant, because the t statistic is approximate, the standard error is approximately five.
00:32:05.000 --> 00:32:15.000
And this means that we have a t statistic here of three, somewhat less than three, so very clearly significant.
00:32:16.000 --> 00:32:20.000
A few things to say about that.
00:32:20.000 --> 00:32:39.000
And I actually have prepared a number of questions for you. But I think I will pose the questions now and answer them myself. Actually, I have provided for the answer to be in the slides because perhaps it's not so easy when you're inexperienced in these things to answer the questions freely.
00:32:39.000 --> 00:32:56.000
But you may make the test at home that you first look again at the regression results, then look at the questions, try to answer them on your own and see whether you can remember what I explained about it and can reproduce the answers or even check them critically,
00:32:56.000 --> 00:33:00.000
and then compare with the answers that I have provided.
00:33:00.000 --> 00:33:13.000
So one thing, for instance, which you should mention is the following that, regardless of whether we look at the time there dummy or the location dummy or the interaction term between time and location.
00:33:13.000 --> 00:33:23.000
We see that the standard errors which are estimated for the coefficients decrease when we have a higher number of controls.
00:33:23.000 --> 00:33:36.000
So here the standard error was 4000 years 3400 years 2800. So this estimate is much precisor than this estimate here.
00:33:36.000 --> 00:33:59.000
Similarly, here, even though the fall in the standard error is not quite as drastic. We have 4.9 or 4,900 here, where 4,800 essentially here 4,500 essentially here. So again, the standard error does decrease and the estimate becomes more precise.
00:33:59.000 --> 00:34:14.000
Also, the same thing is true for the product term for the interaction term where the standard error decreases from 7,400 to 6,400 to approximately 5,000.
00:34:14.000 --> 00:34:35.000
So again, this is the precise estimate we have. So almost clearly, the standard errors decrease when we increase the set of regressors, increase the number of covariates, but there's one big exception which you may have spotted already and this is about the constant.
00:34:35.000 --> 00:34:50.000
For the constant, indeed, the standard error does decrease from regression one to regression two, but look what happens when we go to regression three here. Suddenly, the standard error is much, much bigger.
00:34:50.000 --> 00:34:54.000
Here's 2400, here's 11,000.
00:34:55.000 --> 00:35:12.000
And let's also check what happens to the coefficient of the constant. Well, the constant here is 82,500. Here's 89,000, so approximately the same range, not quite given the standard error, it's actually higher here, it's actually significantly higher.
00:35:12.000 --> 00:35:19.000
But here's just 14,000. So there's a huge drop in the constant here.
00:35:19.000 --> 00:35:34.000
And one question, which I will raise now, is what does this mean? Why is in this regression, which seems to be pretty good with respect to these regressors here, why do we have such a low constant here?
00:35:34.000 --> 00:35:44.000
Why is it not significant anymore? And why is it so low?
00:35:44.000 --> 00:36:02.000
Before I come to the answer of that, there's another thing which is quite interesting, and this is that the sign which we estimate for the location dummy changes from regression one to regression two and stays positive in regression three.
00:36:03.000 --> 00:36:17.000
So recall, the sign of this regressor here told us actually by how much the location without incinerator impacts on the house prices of houses close to this site.
00:36:17.000 --> 00:36:36.000
And the interpretation in regression one which we had was where there is a negative impact. So we argued houses close to the incinerator site were cheaper, were cheaper in 1978 already before the incinerator was built.
00:36:36.000 --> 00:36:41.000
And they have a cheaper buy on average $19,000.
00:36:42.000 --> 00:36:58.000
Now, regressions two and three do not support this conclusion anymore or change it in a very structured way. Here houses close to the incinerator would be more expensive on average.
00:36:58.000 --> 00:37:03.000
And here also, then houses which are far off the incinerator.
00:37:03.000 --> 00:37:12.000
Well, why is that? Let's look first at regression two here. Clearly in equation two we control for age, right?
00:37:12.000 --> 00:37:28.000
So apparently it is the case that houses which are close to the incinerator were probably older than houses which were far off the incinerator site.
00:37:28.000 --> 00:37:43.000
And this age effect was responsible for the negative estimate which we had here where we just compared all houses close to the incinerator with all houses far off the incinerator.
00:37:43.000 --> 00:37:56.000
We did not control for the fact that the houses close to the incinerator were, for whatever reason, older than the houses that were far off the incinerator and older houses are probably less valuable.
00:37:56.000 --> 00:38:09.000
Right? So this is why we had a negative effect here. But once we control for age all the negative effect goes into the age regressors here, we would actually see that if we had the coefficients available now.
00:38:09.000 --> 00:38:21.000
And after having controlled for age, apparently the location close to the incinerator was more valuable than locations far off the incinerator.
00:38:21.000 --> 00:38:30.000
One reason may, for instance, be that the incinerator was built somewhere where there was a good traffic infrastructure.
00:38:30.000 --> 00:38:42.000
Of course, obviously the city of North Endova might have thought we need to build the incinerator where the garbage trucks can easily go to and come from.
00:38:42.000 --> 00:38:48.000
So we need good roads there. And good roads typically raise prices of houses.
00:38:48.000 --> 00:39:04.000
So apart from the fact that the incinerator is a nuisance, the houses which were close to the incinerator had probably better traffic infrastructure than the houses far off the incinerator and therefore for controlling for other differences like, for instance,
00:39:04.000 --> 00:39:11.000
age it turned off, then those houses here were actually more valuable.
00:39:11.000 --> 00:39:28.000
Now this effect diminishes a little bit when we also take other indicators into account, which also capture certain determinants of the prices, like the area of the house, the number of rooms, the number of beds and all these kind of things.
00:39:28.000 --> 00:39:44.000
Obviously, houses which are close to the incinerator had different characteristics on average from houses far off the incinerator. So these other controls here again changed the estimates of the prices.
00:39:44.000 --> 00:39:56.000
And after having control for everything we can control for, it still turns out that houses close to the incinerator were actually more valuable than houses far off the incinerator.
00:39:56.000 --> 00:40:10.000
Sorry, this positive scientist says that houses close to the incinerator were more valuable prior to having the incinerator being built than houses which were far off this location.
00:40:10.000 --> 00:40:19.000
But the amount by which these houses were more expensive is not as big anymore as it is in regression two.
00:40:19.000 --> 00:40:32.000
Now comes the question, why is the constant so low here? And the answer to that is actually very easy.
00:40:32.000 --> 00:40:41.000
The question may also be framed as what does this constant tell us and what does this constant tell us and what finally does this constant give us?
00:40:41.000 --> 00:40:56.000
Let's first start with this constant. This constant here is the average price of a house in 1978 and far off the incinerator set, as we have discussed, right?
00:40:56.000 --> 00:41:11.000
So this is the average price of a house which is not sold in 1981 but sold in 1978 and which is not close to the incinerator set so that all these three effects here don't play a role.
00:41:11.000 --> 00:41:19.000
So it's the average price of a house in 1978 far off the incinerator price.
00:41:19.000 --> 00:41:38.000
This constant here is the average price of a house sold in 1978 and off the incinerator side, so far from the incinerator side, when it was newly built.
00:41:38.000 --> 00:41:48.000
So basically same thing as here but also including the fact that the house was newly built because age is being controlled for here.
00:41:48.000 --> 00:41:58.000
So this value here is the value which obtains when age is zero, right? So age being zero means the house is newly built.
00:41:58.000 --> 00:42:17.000
So in this case here it is the price of a house which has been just constructed and this of course more expensive than the average house which was far from the incinerator side and sold in 1978 which is the same property which we have here.
00:42:17.000 --> 00:42:23.000
But here in addition we have the property that the house has just been constructed.
00:42:23.000 --> 00:42:27.000
Now let's think about this thing here.
00:42:27.000 --> 00:42:38.000
By the same token we may say it is the price of a house sold in 1978 far from the incinerator side and newly built.
00:42:38.000 --> 00:42:41.000
So that would not explain why the price is so low.
00:42:41.000 --> 00:42:46.000
But now recall that we have the full set of controls here.
00:42:47.000 --> 00:43:03.000
So it is the house, as I have just said, sold in 1978, far off the incinerator, newly constructed with no baths, no rooms, no place to live in, no space to live in, no yard, nothing.
00:43:03.000 --> 00:43:11.000
So all the controls actually set to a level equal at zero and equal to zero, right?
00:43:11.000 --> 00:43:26.000
This is why the constant here is so low because it implicitly assumes that this is the price of a house which is completely unattractive, which one can actually not live in because there is no bath, no room, no space, nothing.
00:43:26.000 --> 00:43:31.000
Very clearly the appropriate estimate here would be zero.
00:43:31.000 --> 00:43:39.000
But watch the standard error. The standard error is 11 and the estimate is 14 so this estimate is not significant.
00:43:40.000 --> 00:43:49.000
Actually we have estimated zero here and this is completely reasonable because this should be the price of a house which has a value of zero.
00:43:53.000 --> 00:44:03.000
I think I have covered most of the questions already which I have prepared here. What happens to the standard errors of the estimates as the number of controls increases that I have covered.
00:44:04.000 --> 00:44:16.000
How do we explain the results for the constant, column three, that were just my last remarks. How do we explain the difference between the estimated constants in columns one and two that I have also explained newly built house in column two.
00:44:16.000 --> 00:44:24.000
What does with increasing number of controls, the sign change for near income mean that I have covered?
00:44:24.000 --> 00:44:29.000
What do we learn about the price effect of the incinerator and how reliable is our knowledge?
00:44:29.000 --> 00:44:44.000
Perhaps I haven't said this so explicitly. Well, I mentioned it, the standard error here is much lower than it was here. So this estimate is more reliable than the estimates we have had previously.
00:44:44.000 --> 00:44:58.000
And how well do we explain the observed data? Oh yes, I think I didn't get mentioned here the r squared is 66%. So again, up quite a bit from the 41% here and much higher than the 17% here.
00:44:58.000 --> 00:45:02.000
So we explain quite a bit of the variance, two thirds of it.
00:45:02.000 --> 00:45:10.000
All right, here are the answers which you may look at again at home when you study this example.
00:45:10.000 --> 00:45:22.000
Now, the case of this incinerator is an example of what is sometimes called a natural experiment or quasi experiment.
00:45:22.000 --> 00:45:37.000
It is not an experiment which has been designed by somebody who intentionally wanted to carry out an experiment, but rather what we mean by natural experiments means that there is some exogenous event.
00:45:37.000 --> 00:45:52.000
So for instance, a change in government policy exogenous to the people affected by the event, which changes the environment in which the agents live or work.
00:45:52.000 --> 00:46:03.000
So if we have such an exogenous event, then for a quasi experiment it is important that we have some people which are affected by the exogenous event and others are not.
00:46:03.000 --> 00:46:19.000
So that we have a treatment group, and then we have a control group. And if this is so, we may say this is almost as good as designing an experiment by stretch of having an intentional experiment.
00:46:19.000 --> 00:46:48.000
And we can apply the techniques which we have learned to randomized experiments analogously now to natural experiments or to quasi experiments and this is precisely what the differences and differences estimator does evaluate experiments which come from policy changes or other exogenous events.
00:46:48.000 --> 00:47:07.000
Yeah, I have some discussion which is perhaps not giving lots of additional insight here, just perhaps to illustrate what other natural experiments there may be.
00:47:07.000 --> 00:47:18.000
Suppose that the government cuts the level of unemployment benefits, only for some group of people, group A treatment group.
00:47:18.000 --> 00:47:35.000
And group A normally has longer unemployment durations than group B, the control group, then you can look of course after the cut in the unemployment benefits at the unemployment durations of group A and see whether unemployment durations have been reduced,
00:47:35.000 --> 00:47:57.000
which would then support the hypothesis that unemployment benefits have disincentive effects for people to look for new work, and obviously you can apply a difference in differences estimator to such data sets.
00:47:57.000 --> 00:48:13.000
What we need to always have is an assumption that treatment and control group are not affected by anything else than the change in policy.
00:48:13.000 --> 00:48:34.000
So in particular, since typically things tend to change over time, if trend influences change over time, then what we need is that this change which happens over time affects both the treatment group and the control group in the same way.
00:48:34.000 --> 00:48:38.000
This is sometimes called the common trend assumption.
00:48:38.000 --> 00:48:55.000
So we would need the assumption that in the absence of treatment, treatment and control group would develop in exactly the same way, and only due to the treatment so only due to the policy experiment.
00:48:55.000 --> 00:49:04.000
Is there a difference between what happens in the treatment group and what happens in the control group.
00:49:04.000 --> 00:49:16.000
This assumption is a necessary assumption but again it is an assumption which cannot directly be tested because it is an assumption which concerned the counterfactual development.
00:49:16.000 --> 00:49:34.000
Here's again a table which explains the principle of differences estimation. I think we have basically covered this but look at this table at home basically it interprets again the coefficients which are being estimated and then takes the difference.
00:49:34.000 --> 00:49:56.000
So this is treatment minus control, and this is after minus before, and if you take the difference between treatment minus control and subtract from this the difference between after and before, then we just arrive at the data one, which is the coefficient of interest.
00:49:56.000 --> 00:50:04.000
Another example here to perhaps conclude on differences estimation problems.
00:50:04.000 --> 00:50:23.000
There was a research project on worker compensation, worker compensation laws, and the number of weeks a worker was out of work due to some injury, he has suffered during his birth.
00:50:23.000 --> 00:50:45.000
So in general injured workers are entitled to receive benefits and the state of Kentucky in 1981 1980 raised the maximum benefit to injured worker from hundred and thirty one dollar per week to $217 per week so quite a substantial increase in the maximum benefit.
00:50:45.000 --> 00:50:57.000
But no worker was allowed to receive more than 66% of his regular pay per week so there was a cap on the benefit.
00:50:57.000 --> 00:51:09.000
And now this cap was not changed, so it was still in force that no broker was allowed to receive more than 66% of his regular pay.
00:51:09.000 --> 00:51:34.000
This means that low income workers did not really benefit from this increase in benefits, because the hundred thirty one were possibly, at least for the lowest income group hundred thirty one dollars was already more than 66% of their regular wage.
00:51:34.000 --> 00:51:58.000
So, increasing the benefit here did not do any good to the really low income workers, but workers with someone higher income were benefiting from this increased absolute payment per week, because the hundred thirty one dollars here did not yet drink 66% of their regular pay per week.
00:51:58.000 --> 00:52:15.000
This is not as in just as it may seem, because it is just to say that a worker who has been injured while working should receive a certain percentage of his regular wage during the time in which he cannot work.
00:52:15.000 --> 00:52:34.000
And you may have said that the initial situation was not just where perhaps the really low paid workers received 66% of their pay but the higher paid workers received only 40% or even 33% of their pay while they were out of work.
00:52:34.000 --> 00:52:47.000
Whatever, we are not here to talk about equity aspects of government policies, but rather what we want to study is natural experiments.
00:52:47.000 --> 00:53:08.000
And here is a natural experiment in the sense that before the increase, it was true that the weekly benefit amount, there was certain minimum and then this was increasing up to 66 up to hundred and thirty one dollars.
00:53:08.000 --> 00:53:26.000
So this was the maximum benefit which was being paid out. So everybody who had higher wages than this, so down here on the x axis there's earnings, right, they just received hundred and thirty one dollars.
00:53:26.000 --> 00:53:35.000
Whereas, after the increase the maximum benefit was the two hundred and something, two hundred and seventeen or something like this.
00:53:35.000 --> 00:53:46.000
So the cap was only coming into effect at earnings E3 rather than at earnings E2.
00:53:46.000 --> 00:53:59.000
So what we have now here is a low earnings group between E1 and E2, which did not benefit from the increase in the maximum benefit.
00:53:59.000 --> 00:54:15.000
And then we have a high earnings group from E3 up to, I don't know, infinity, which did increase by this amount here, each and every worker from the increase in the maximum benefits.
00:54:15.000 --> 00:54:31.000
Now, in the publication by the authors of this study, which was published in the American Economic Review in 1995, the middle income group between E2 and E3 was left out because their effect mixed.
00:54:31.000 --> 00:54:44.000
But what Maya Viscussi and Durbin, these are the authors, wanted to check was whether higher benefits decrease workers incentives to avoid injuries.
00:54:44.000 --> 00:54:54.000
So perhaps one may say if the incentive is higher, people are less careful because the thing I'm also quite well paid when I'm out of work.
00:54:54.000 --> 00:55:01.000
So perhaps the duration of workers being out of work increases due to higher benefits.
00:55:01.000 --> 00:55:07.000
Or higher benefits may increase the incentives to fight for compensation at all.
00:55:07.000 --> 00:55:13.000
So for the low benefits, perhaps some people didn't even bother to file for compensation.
00:55:13.000 --> 00:55:17.000
If it's lots of paperwork, perhaps it makes sense just not to do it.
00:55:18.000 --> 00:55:26.000
And also higher benefits may foster more claims for non-work injuries, so cheating may actually increase.
00:55:26.000 --> 00:55:34.000
So these two aspects here also would result in longer weeks of workers out of work.
00:55:34.000 --> 00:55:46.000
And finally, higher benefits may make extending the duration of a claim more attractive, so staying out of work longer just because you receive higher benefits.
00:55:46.000 --> 00:55:51.000
So that could be checked in a difference in differences estimate.
00:55:51.000 --> 00:55:59.000
And we might find then that there is an increase in the average duration of injured workers being out of work.
00:55:59.000 --> 00:56:23.000
But we would not be able to identify what the cause for the increase in the average duration is, whether it is decrease in incentives to avoid injuries or increase in incentives to fight for compensation or foster more claims for non-work injuries or cheating or whether extending the duration is so very attractive.
00:56:23.000 --> 00:56:38.000
Then we would not be able to distinguish. We would just be able to find out whether the higher benefits were causal for an increase in the average duration of injured workers.
00:56:38.000 --> 00:56:54.000
All right. And here is what Maia Viscousi and Durbin have found out. So for the high earnings group, the log duration before the increase was 1.38.
00:56:54.000 --> 00:57:03.000
So the log duration for being out of work was 1.38. And after the increase, it was 1.58, so it has increased.
00:57:03.000 --> 00:57:20.000
Whereas for the low earnings, the increase with the time out of work, the log time out of work before the increase was 1.13, and it was completely unchanged after the increase.
00:57:20.000 --> 00:57:33.000
So what we may do is that we just compute the differences of those averages, which we find in this table here, 2 minus 1, right, 2 minus 1 is 0.2.
00:57:33.000 --> 00:57:41.000
And 4 minus 3 is, well, here it's neutral rounding error at 0.01, but essentially zero.
00:57:42.000 --> 00:57:54.000
And then we have to take the difference between column five and column six. So the differences, the difference in differences, this would be 0.2 minus 0.01 is 0.19.
00:57:54.000 --> 00:58:08.000
And that's the difference in differences estimator, which Maia Viscousi and Durbin had computed. It is actually highly significant with a standard error of 0.07.
00:58:08.000 --> 00:58:20.000
If I do a DID regression on the same data in the way in which I have explained it to you, I estimate precisely the same coefficient. It's 0.19, and the standard error is 0.07.
00:58:20.000 --> 00:58:33.000
The data I have uploaded to steam so you can easily reproduce this for a great number of workers, right, 5,600.
00:58:33.000 --> 00:58:50.000
Okay, this I have said, and yeah, note the low r squared, please, it's just 2%. So it is possible to get highly significant estimates if r squared is low.
00:58:50.000 --> 00:59:06.000
Now I have also prepared a number of slides for DID estimation with multiple time periods, but due to time reasons, I won't cover this. That is a straightforward extension. So we just skip this here
00:59:06.000 --> 00:59:21.000
and go to the common trend assumption, which I already mentioned that in the absence of the treatment, the treatment group should have followed the same trend as the control group.
00:59:21.000 --> 00:59:42.000
So if we distinguish the outcomes, the potential outcomes here by A and B for after and before, we would want that the no treatment outcomes, so indexed zero here and there and here and there.
00:59:42.000 --> 00:59:59.000
The no treatment outcome or the difference between after and before for the group of treated would be the same as the no treatment outcome for the difference between after and before for the group of those which have not been treated.
00:59:59.000 --> 01:00:15.000
That is the common trends assumption and without this assumption or without this assumption holding the DID estimator and the inference based on it would be distorted.
01:00:15.000 --> 01:00:33.000
Again, it cannot be tested this assumption, unfortunately, it is just an identification assumption, but sometimes one can make this assumption plausible or one can argue that it is not plausible and to this setting where this assumption is not plausible, we now turn.
01:00:33.000 --> 01:00:43.000
Okay, here's again your research quiz question, which you may want to think about. Can you think of a way to apply a different different strategy for your case?
01:00:43.000 --> 01:00:56.000
What would then be the treatment and what is the control group? And please describe the common trend assumption for your case.
01:00:56.000 --> 01:01:13.000
Good. There's another application by Markus Ziedler, which I skip because I want to come to the case where the common trend assumption is violated and this is now 9.3 synthetic control groups.
01:01:13.000 --> 01:01:26.000
And here again, I just present you the basic idea by means of an application. I should perhaps first mention what a synthetic control group is.
01:01:26.000 --> 01:01:45.000
The idea is that if the actual control groups we have in our sample seem to follow a different trend from the treatment group is following, then we may try to construct synthetically from the actual controls.
01:01:45.000 --> 01:01:58.000
A control which is artificial synthetic, but which has the same trending behavior as the treated unit.
01:01:58.000 --> 01:02:14.000
So we construct a counterfactual using a weighted average of the natural control units in such a way that we think it likely that the common trends assumption holds.
01:02:14.000 --> 01:02:33.000
The classical study for that is, well, I would really say it's already a classic, even though it's just 10 years old, is Abadi Diamond and Heinmuller's study on the effects of California's tobacco control program which I give you here as a reference.
01:02:33.000 --> 01:02:40.000
Let me take the question which comes in.
01:02:40.000 --> 01:02:55.000
Now I really would like to know whether the material I skipped is relevant for the exam. No, it is not. For the exam is relevant only what I have covered in the lecture and nothing beyond what I've covered, but everything I've covered is relevant.
01:02:55.000 --> 01:02:59.000
All right.
01:02:59.000 --> 01:03:12.000
Now, what happened here was that the state of California in the United States enacted a tobacco control program in 1988.
01:03:12.000 --> 01:03:28.000
So this was called Proposition 99. This was the name of the program, and it was implemented in 1988 with the aim of reducing cigarette smoking, or more generally, tobacco consumption.
01:03:28.000 --> 01:03:50.000
So we have just one treated unit here which is California. And then we have all the other American states, or denial of them, which did not change their tobacco control programs if they had any so they didn't change their status in terms of tobacco control measures.
01:03:50.000 --> 01:04:01.000
This graph shows you per capita cigarette sales in packs from 1970 to the year 2000.
01:04:01.000 --> 01:04:12.000
And the black line here, the solid line is California, whereas this year is the rest of the United States.
01:04:12.000 --> 01:04:21.000
It's as I say, per capita so we can easily compare California with the rest of the United States despite different populations.
01:04:21.000 --> 01:04:30.000
What we see is that in California and the United States, initially, consumption was rising until about 1975.
01:04:30.000 --> 01:04:46.000
In California, it was decreasing quite strongly between 1975 and 1988. Here the dotted line is 1988 when the program was initiated, was implemented.
01:04:46.000 --> 01:05:01.000
We also have a declining trend, the rest of the United States but it sets in somewhat later, and the decline is perhaps not as steep as it is in California, already prior to the tobacco control program.
01:05:01.000 --> 01:05:19.000
So quite clearly the common trends assumption would be a very courageous assumption here, because we observe already in the time period in which there was not yet any tobacco control program implemented in California that trends are different.
01:05:19.000 --> 01:05:31.000
So, would be very very implausible that trends would be the same trends or the potential outcome would be the same in the time after the measure was implemented.
01:05:31.000 --> 01:05:49.000
So, there is no sense in somehow constructing an estimate between the dotted line here and the solid line here as the main effect of the tobacco control program.
01:05:49.000 --> 01:05:58.000
The solution for this problem was by Abadi Diamatt-Heinmuller to create an artificial California.
01:05:58.000 --> 01:06:19.000
And from all the other 49 states, they took some states and constructed linear combination, such that the artificial California synthetic control group was as similar to California pretreatment as possible.
01:06:19.000 --> 01:06:42.000
So they tried to find a linear combination of the other states in the United States, such that cigarette consumption per capita in packs prior to 1988 was almost the same in the synthetic California which they constructed, as in the real world California which we see in this graph here.
01:06:42.000 --> 01:06:55.000
And this is the result of what they have done. Solid is again the true California and the dotted line is now the synthetic California which Abadi Diamatt-Heinmuller have constructed.
01:06:55.000 --> 01:07:16.000
And then they use the same weights applied to the same controlled states in the United States to compute what the counterfactual California would have had in terms of cigarette consumption had it not implemented, Proposition 99.
01:07:16.000 --> 01:07:42.000
And compare this to the actual experience of California under Proposition 99. Very obviously the difference between these two is then taken to be the cause effect of Proposition 99 so the degree by which California has succeeded at reducing smoking by Proposition 99.
01:07:42.000 --> 01:08:05.000
How do Abadi Diamatt-Heinmuller construct this? Well actually they use a number of variables, for instance, the average retail price of cigarettes and the per capita state personal income and the percentage of the population in the age between 15 and 24.
01:08:05.000 --> 01:08:15.000
And the per capita beer consumption and three years of lagged smoking consumption so three years of the outcome variables.
01:08:15.000 --> 01:08:27.000
So all these variables are somehow related, somehow correlated Abadi Diamatt-Heinmuller hope to cigarette consumption.
01:08:27.000 --> 01:08:48.000
And what they do now is that they look for a combination of other states of the United States, such that this combination is such that the average retail price of the linear combination of the synthetic California is close to the average retail price of cigarettes in California.
01:08:48.000 --> 01:08:59.000
And the per capita state personal income is close to the one in California. And the same thing for all these so-called predictors here.
01:08:59.000 --> 01:09:14.000
So they minimize some distance between these variables in California and a linear combination of these variables from other states.
01:09:14.000 --> 01:09:27.000
Minimize means using those weights for constructing the linear combination which minimize the distance.
01:09:27.000 --> 01:09:44.000
These are, as I have written it on the slide here, chosen in order to minimize some distance measure. It's somewhat complicated the distance better so I would go into the data details here between the treated and the synthetic controls, between the treated unit and the synthetic controls.
01:09:44.000 --> 01:09:51.000
Basically, they minimize something like a mean squared error for the pretreatment outcomes.
01:09:51.000 --> 01:09:58.000
The weights they use are constrained to be non-negative and to sum up one.
01:09:58.000 --> 01:10:17.000
These are two restrictions which are not necessary to impose. One can also do without them. There are certain risks associated with doing without them so Abadi Diamatt-Heinmuller have not lifted these restrictions, but in principle, one can also do without them.
01:10:17.000 --> 01:10:33.000
So the risk, essentially, perhaps to make it intuitive, is that when the weights are negative, then we may actually include states which could just be opposite of what happens in California.
01:10:33.000 --> 01:10:40.000
And it's not really clear when this would give a good contribution to the synthetic control.
01:10:40.000 --> 01:11:03.000
And when we allow for weights which not only sum up to one, but which perhaps are even larger than one, then we allow for extrapolation. We would not ensure anymore that the synthetic control is sort of the convex or within the convex hull of really existing states, but it can be outside of this convex hull.
01:11:03.000 --> 01:11:24.000
This also increases the possibility that grave errors occur when comparing California to the synthetic control. So Abadi Diamatt-Heinmuller have imposed these two restrictions here and actually the result they obtained was quite interesting and possibly very informative.
01:11:24.000 --> 01:11:29.000
So people were generally satisfied with what they have done.
01:11:30.000 --> 01:11:40.000
In this table, you see the state weights in the synthetic California, and you see that most of the states actually receive zero.
01:11:40.000 --> 01:11:53.000
You see some of the states were actually not included in the so-called donor pool of possible control regions, because there were certain data issues, data not available or something like this.
01:11:53.000 --> 01:12:09.000
So wherever you see this little bar here, these have not been included at all. But all the others, I think there were 38 of them, have been included and offered as a possible contributor to the synthetic control.
01:12:09.000 --> 01:12:25.000
But the minimization procedure has decided to discard them. So the weight for many of these states is just zero, and some states have been picked to contribute to an artificial California with a certain weight.
01:12:25.000 --> 01:12:48.000
So Colorado, for instance, which I'm not mistaken is neighboring state, 16% and Connecticut, which is certainly not a neighboring state, just 7%, Montana, 20%, Nevada, which is a neighboring state, 23%, and Utah, which I think is also a neighboring state, 33%.
01:12:49.000 --> 01:13:13.000
So these states seem to have some similarity with California, and not taking an weighted average of them yields a per capita cigarette consumption, which is over time very, very similar for the synthetic control yet to California to the actual California.
01:13:13.000 --> 01:13:42.000
Now the question of course is, is the causal effect which you have just briefly seen, is that significant? Or is this just a mere coincidence? So the question is, how would we get a sense of whether our inference here is actually more than just perhaps a coincidental deviation between the artificial California and the true California?
01:13:42.000 --> 01:14:03.000
So what can be done in order to assess this question is that certain so-called placebo permutation tests are applied. So basically what one does is, after having constructed the synthetic California,
01:14:04.000 --> 01:14:10.000
Abbadidam and Heinmuller also construct synthetic regions for each other
01:14:12.000 --> 01:14:24.000
group in the, for each other state in the control group. So synthetic Montana, synthetic Connecticut, synthetic Texas, synthetic Florida and so forth, using just the same method.
01:14:25.000 --> 01:14:43.000
And then again comparing for each of these states, for all of the 38 states which have sufficient data, the actual outcome and the synthetic outcome and see how far they are off from each other.
01:14:44.000 --> 01:14:49.000
And what comes out then is that the
01:14:51.000 --> 01:14:57.000
GDP in per capita consumption for the control states is
01:14:58.000 --> 01:15:01.000
predicted by all those gray lines here.
01:15:02.000 --> 01:15:09.000
And very clearly California is very much at the lower range of all the predicted
01:15:10.000 --> 01:15:25.000
zero consumption. So if you think of this as something covering like 95% of the outcomes, then it seems to be the case that California is beyond the 95%. So it is significant at the nine at the at the 5% level.
01:15:26.000 --> 01:15:43.000
Right. So that is at least a hint if yet not a completely satisfactory status statistical evidence for Proposition 99 to really have reduced cigarette consumption in California.
01:15:44.000 --> 01:16:03.000
There was also a second attempt by Abadi, Diamond, Heinmuller to do this a little bit more regularly because there are some states here whose synthetic controls are really way off. Right. And clearly obviously bad
01:16:04.000 --> 01:16:08.000
attempts to construct a synthetic control. So they have also
01:16:09.000 --> 01:16:18.000
produced this graph here by just looking at states which had a reasonable mean squared prediction error.
01:16:19.000 --> 01:16:30.000
And by reasonable, they meant that the mean squared prediction error was smaller than two times the pre Proposition 99 mean squared prediction error for California.
01:16:30.000 --> 01:16:47.000
And in this case, it becomes even clearer that California is way below what happens in other states. So very clearly Proposition 99 has had an effect in reducing cigarette consumption in California.
01:16:47.000 --> 01:17:03.000
There is actually code available both in Stata and in R to do these kind of synthetic control methods on other questions.
01:17:03.000 --> 01:17:25.000
There are also lots of papers which have been written in recent years on how to improve on the method by Abadi, Diamond, Heinmuller or how to say cover multiple treated units or how to deal with issues of missing data along the way, which is always a problem with missing data.
01:17:25.000 --> 01:17:49.000
So lots of research is taking place here. But for you currently, I think it suffices to get an idea of what the basic mechanism is the so-called synthetic control approach embodied in programs called just synth by Abadi, Diamond, Heinmuller.
01:17:49.000 --> 01:18:12.000
As I say, there is code available in MATLAB in R and in Stata. So I think I forgot to mention MATLAB and myself have written code in Gauss for that. So that's quite an active area of research and many questions can be fruitfully addressed when these methods are employed.
01:18:13.000 --> 01:18:32.000
I conclude the lecture here, the last examples I will not cover and do not have the time to cover, but I think you have had an impression of how differences works and how synthetic control methods work.
01:18:33.000 --> 01:18:50.000
If there are any remaining questions, we can cover them in interaction or here in the chat. So please, if you have further questions also relating possibly to the exam, even though I think I have given all the information I could in recent days,
01:18:51.000 --> 01:19:07.000
please feel free to raise the questions now. And if you would like to go to interactive mode, then please also tell me. Somebody raises the hand. I will just wait for the question to come in via chat. And here I think it is already.
01:19:09.000 --> 01:19:17.000
That's a long question. Okay. Regarding your email on the examples, you offered a digital oral exam in March for special cases.
01:19:17.000 --> 01:19:29.000
For those who are interested in that option, could you please provide further information on the procedure of such an oral exam as many of us are not familiar with that. For example, will we get time to solve the exercise and present them afterwards?
01:19:30.000 --> 01:19:43.000
What would it be like? Also, what would be the duration and what and would it take place on March 2? Or is another date within March possible? Thank you for your answers. Yes, that's a very legitimate question.
01:19:43.000 --> 01:20:07.000
I haven't really fixed an exam date for the oral exam yet, but I tend to offer that in the later part of March. So rather at the end of March, actually I'm flexible on when exactly to do which type of exam because these will be single candidate exams.
01:20:07.000 --> 01:20:29.000
And probably what I will do is that I will inform you on a certain exam date. Let's say it's March 23. The originally scheduled date for the second written exam. I think that would be my prime candidate for doing the oral exam there.
01:20:29.000 --> 01:20:50.000
And then you would have time to register for that. And if somebody comes up and says, well, we can't for some reason do it on March 23. There should be some other way. Okay, that's, I think we would find a solution there, even though I would not be willing to, you know, have oral exams on every single day or something like that.
01:20:50.000 --> 01:21:05.000
So I would like to have them concentrated, but in case you have good reasons, I'm also flexible on giving other exam dates there. The exam itself takes about between 20 minutes and half an hour's time.
01:21:05.000 --> 01:21:23.000
You would not get questions beforehand, which you can prepare on, but rather I would present you certain questions in the oral exam. So it may, for instance, be that I present you certain formula we have covered in class.
01:21:23.000 --> 01:21:34.000
And I asked you to explain them, ask you to explain their significance, ask you to discuss problems which are connected with, say, certain estimators.
01:21:34.000 --> 01:21:52.000
It may be that I present you regression output and then I will tell you, well, look at this regression output and you get two or three minutes or something like this to study that. Perhaps I will have highlighted something in the regression output to focus your attention on what is actually important in the regression output or not.
01:21:52.000 --> 01:22:00.000
That depends on the precise question I have in mind. And then I pose my question and you'll have to give your answer.
01:22:00.000 --> 01:22:18.000
The oral exam is different from a written exam because it is more flexible and this has advantages and disadvantages for you depending on your personal taste, actually.
01:22:18.000 --> 01:22:31.000
The advantage is that if I pose a question which you have trouble to understand, you can just ask me back, right? And I can clarify.
01:22:31.000 --> 01:22:45.000
Or if the question was posed well, but you don't know the answer, then typically it's not the case that I just move on to the next question, but usually I will try to help you.
01:22:45.000 --> 01:23:07.000
And we'll give you additional hints or pose the question in a different way or narrow the question to some extent to allow your first grip on the answer and then I will expand on the problem and see whether you, with some help, will be able to solve the question that I have posed.
01:23:07.000 --> 01:23:22.000
Obviously, when I have to help you a lot, that impacts somehow on your grade, but you would not be lost completely, but rather typically you would sooner or later find some way to solve the problem.
01:23:22.000 --> 01:23:34.000
So that, I think, is an advantage because obviously it sometimes happens in written exams that there is the problem and you just don't really know what to do about it.
01:23:34.000 --> 01:23:40.000
Perhaps you, because you haven't prepared where perhaps you misunderstood the question, even though I think I posed the question in such a way you can't really misunderstand it.
01:23:41.000 --> 01:24:02.000
The disadvantage, if you think it is a disadvantage, is of course that in an oral exam, if you give me an answer and I'm not really sure that you really understood what you said or really understood the nature of the problem, then I will ask you again.
01:24:03.000 --> 01:24:13.000
And I can be quite pertinent in sort of pressing you for the correct answer if I'm dissatisfied with a, let's say, a rather vague answer.
01:24:13.000 --> 01:24:29.000
In a written exam, sometimes people get by with vague answers because when I grade the exam then and I read the answer and it is not exactly the answer that I intended you to give, then I always have to sort of weigh
01:24:29.000 --> 01:24:42.000
how much of the answer is correct or could have been correct or perhaps you, I will then think about it, did you perhaps mean the correct thing but just you expressed it not so well.
01:24:42.000 --> 01:24:58.000
And there's some kind of judgment from my side in there, but usually if I think it may have been meant correctly, then in the written exam, I will rather give you the marks for the answer.
01:24:58.000 --> 01:25:09.000
Or the answer, then not give you, give the marks to you. Of course there's some limit to that too so you cannot just get away with something completely vague.
01:25:09.000 --> 01:25:27.000
You have to answer the question but on sort of borderline cases, yes, that probably usually works in the favor of people who have less precise answers than in an oral exam, where I sometimes experience that students
01:25:27.000 --> 01:25:34.000
give an answer and then when I press for more details for more precision or something like this, then there is virtually nothing.
01:25:34.000 --> 01:25:46.000
Then I realized, well, the candidate, unfortunately, hasn't really understood what the whole thing was about. And that impacts them of course negatively on your, on your grade.
01:25:46.000 --> 01:26:05.000
So I think I get in an oral exam, better assessment actually of your degree of understanding, then I get in a written exam, which is why people who have understood things very well typically do very well in oral exams.
01:26:05.000 --> 01:26:17.000
And students who have not so well mastered the material sometimes may not do very good, may actually flunk the oral exam.
01:26:17.000 --> 01:26:24.000
So, yeah, it depends on your degree of understanding.
01:26:24.000 --> 01:26:34.000
That I think was a lengthy answer to this question but it was a lengthy question and I hope I have
01:26:34.000 --> 01:26:37.000
added it to your satisfaction.
01:26:37.000 --> 01:26:50.000
Then there's another question on my time series course, can we take that course without having passed this one, given that we will be taking exams for this course in May. Yes, there is no
01:26:51.000 --> 01:27:02.000
I mean it is of course good if you have mastered the content of this course or at least the contents of basic econometrics when you want to do time series econometrics.
01:27:02.000 --> 01:27:20.000
Actually not everything which I did here in this course would be necessary at all to understand what time series is all about. Basically, everything which I did after the basic econometrics stuff would not be relevant for the time series course anyway.
01:27:20.000 --> 01:27:42.000
But you should know about basic econometric stuff this is in some sense a precondition by content, but not formally. So yes, you can take the time series course, even if you have not taken the exam of this course or even if you have not passed the exam for this course provided that you feel sufficiently
01:27:43.000 --> 01:27:51.000
interested and sufficiently well equipped in terms of knowledge to go into a time series class.
01:27:55.000 --> 01:27:58.000
All right, are there any other questions.
01:27:58.000 --> 01:28:14.000
Apparently not, then I wish you all very well for the semester break, and of course for the exam when it comes up.
01:28:14.000 --> 01:28:30.000
I suppose most of you will not need to take the oral exam and as I said, it is designed only for those of you who have a particular reason not to go late into the exam.
01:28:30.000 --> 01:28:45.000
That is actually not my decision that was the decision of the program director I suggested to him that I could also offer a written an oral exam to all of you just on May 23 and the second written exam then.
01:28:45.000 --> 01:29:11.000
So March 23, of course, and then written exam and may as the second exam date, but this was not so happy with as so the oral exam is just basically for special cases which typically will apply to people only when they repeat this course and are already progressed there in their studies.
01:29:11.000 --> 01:29:40.000
There's another question coming in. What are the options do I have if I am not in Hamburg for the written exam. Well, in general, I would expect that if you study in Hamburg, you are also in Hamburg, but if you have compelling reasons not to be here, like for instance, the pandemic is still going on.
01:29:41.000 --> 01:29:49.000
There's quarantine, you don't get a visa or something like this, then we will also find a digital.
01:29:49.000 --> 01:29:55.000
Okay, no more questions.
01:29:55.000 --> 01:30:13.000
Then, once again, all the best for the semester break. Good luck for the final exam and I would be happy to see some or all of you in the time series class if you take an interest in time series and that's so far for today and for the semester and
01:30:13.000 --> 01:30:15.000
better.