WEBVTT - autoGenerated
00:00:00.000 --> 00:00:15.000
How would you estimate how many notes geldshiner of a certain kind are actually circulating
00:00:15.000 --> 00:00:17.000
in the population?
00:00:17.000 --> 00:00:23.000
You print some, some disappear, some go under people's beds.
00:00:23.000 --> 00:00:27.000
This is something that banks actually do, I believe.
00:00:27.000 --> 00:00:34.000
It's certainly a technique used by biologists to estimate the size of a population.
00:00:34.000 --> 00:00:38.000
The idea is easy.
00:00:38.000 --> 00:00:45.000
At your bank branch for one day, you collect a thousand notes or you watch a thousand notes
00:00:45.000 --> 00:00:47.000
go by.
00:00:47.000 --> 00:00:54.000
As they go past, you put a label on a thousand notes.
00:00:54.000 --> 00:00:59.000
So these thousand notes, I write my name on them.
00:00:59.000 --> 00:01:07.000
Over the next few days, when notes are being changed in the bank, these notes go out to
00:01:07.000 --> 00:01:09.000
the population.
00:01:09.000 --> 00:01:15.000
So somewhere I have a big population and some fraction of the notes have got my name written
00:01:15.000 --> 00:01:16.000
on them.
00:01:16.000 --> 00:01:22.000
I let this circulate for a week and then I do a new sampling.
00:01:22.000 --> 00:01:28.000
So I watch another thousand notes go by in my branch of the bank.
00:01:28.000 --> 00:01:35.000
And from a thousand notes, I say how many have got the label on them?
00:01:35.000 --> 00:01:41.000
And maybe I find a number like from a thousand notes, 20 are sampled.
00:01:41.000 --> 00:01:44.000
So how big is the population?
00:01:44.000 --> 00:01:47.000
How big is the number of circulating notes?
00:01:47.000 --> 00:01:53.000
Well, it's going to be, you should be able to guess, it's going to be I have a sample
00:01:53.000 --> 00:02:01.000
of a thousand of them, one thousand divided by 20 have been labeled.
00:02:01.000 --> 00:02:11.000
So the original population should be a thousand on 20 times the size of my sample.
00:02:11.000 --> 00:02:19.000
So that's 10 to the fifth divided by two, which is 50,000 notes.
00:02:19.000 --> 00:02:23.000
OK, that is sort of correct.
00:02:23.000 --> 00:02:34.000
My maximum likelihood estimate of the number of notes in circulation is 50,000.
00:02:34.000 --> 00:02:36.000
But how confident can I be of that?
00:02:36.000 --> 00:02:40.000
What do you think my error margins are?
00:02:40.000 --> 00:02:42.000
Good.
00:02:42.000 --> 00:02:45.000
50,000 is your intuitive guess.
00:02:45.000 --> 00:02:49.000
But remember, this is just done by sampling.
00:02:49.000 --> 00:02:56.000
Is it possible that actually only 45,000 are in circulation and how certain are you?
00:02:56.000 --> 00:03:00.000
This we can answer doing a Bayesian simulation.
00:03:00.000 --> 00:03:06.000
And then what we're going to do is make the simulation better and better with additional
00:03:06.000 --> 00:03:07.000
information.
00:03:07.000 --> 00:03:10.000
So we use Bayes rule.
00:03:10.000 --> 00:03:20.000
We say our hypothesis is that we have 50,000 notes and then we're going to say what's the
00:03:20.000 --> 00:03:32.000
probability of this hypothesis given our observation of 20 in a thousand notes being marked.
00:03:32.000 --> 00:03:39.000
So our hypothesis is naive that the number of notes is more than 1,000.
00:03:39.000 --> 00:03:43.000
That's a very so-called flat hypothesis.
00:03:43.000 --> 00:03:46.000
And it's not very informative.
00:03:46.000 --> 00:03:52.000
We're going to actually neglect this part down here in this example.
00:03:52.000 --> 00:03:59.000
This is a fun thing you are allowed to do when you are looking at relative probabilities.
00:03:59.000 --> 00:04:08.000
So our hypothesis is just we have some number of notes out there, more than 1,000.
00:04:08.000 --> 00:04:17.000
And then we're going to look at the probability of our data given various hypotheses.
00:04:17.000 --> 00:04:25.000
And H, our hypothesis might be that we have 50,000 notes or 55,000 notes.
00:04:25.000 --> 00:04:27.000
But our data is something fixed.
00:04:27.000 --> 00:04:34.000
The data is that we sampled 1,000 notes and found 20 to be marked.
00:04:34.000 --> 00:04:45.000
So the mission will be to calculate the probability of our data given different hypotheses.
00:04:45.000 --> 00:04:48.000
So various values of the number of notes.
00:04:48.000 --> 00:04:55.000
So we're going to come out with a probability that you have 40,000 notes, 45,000 notes,
00:04:55.000 --> 00:04:58.000
and so on.
00:04:58.000 --> 00:05:05.000
The term that we're going to calculate is this probability of data for each hypothesis
00:05:05.000 --> 00:05:09.000
for various values of the number of notes.
00:05:09.000 --> 00:05:17.000
So in other words, if I have 40,000 notes, what's the probability or how often do I see
00:05:17.000 --> 00:05:21.000
20 from 1,000 marked notes?
00:05:21.000 --> 00:05:26.000
Or if I had 42,000, how often would I see this?
00:05:26.000 --> 00:05:35.000
The way to do it is we will use an R function that's built in and that's this repeat function.
00:05:35.000 --> 00:05:41.000
And we will make a pool of notes and then do sampling from it.
00:05:41.000 --> 00:05:47.000
So first of all, to get the pool of notes, we use the repeat function.
00:05:47.000 --> 00:05:52.000
And what this says is, remember, C gives us a vector back.
00:05:52.000 --> 00:05:59.000
So this is a vector here with a 1 and a 0 in it, a two element vector.
00:05:59.000 --> 00:06:05.000
The next thing here is another vector with a number of marked notes.
00:06:05.000 --> 00:06:13.000
So maybe that's 1,000 and the number of, well, that can be a much smaller number, the number
00:06:13.000 --> 00:06:19.000
of notes minus the number of marked notes.
00:06:19.000 --> 00:06:23.000
So this might be 1,000 minus 20 over there.
00:06:23.000 --> 00:06:34.000
So we have, we're asking to repeatedly sample, but with these probabilities here.
00:06:34.000 --> 00:06:38.000
That's in the second argument to repeat.
00:06:38.000 --> 00:06:45.000
And we will say, for the sake of programming, that a marked note is a 1, and this is in
00:06:45.000 --> 00:06:49.000
the same order here, and an unmarked note is a 0.
00:06:49.000 --> 00:06:56.000
So at the end of this one line here, we're going to have a very long array in pool of
00:06:56.000 --> 00:06:59.000
notes.
00:06:59.000 --> 00:07:07.000
Maybe this array has 1,000 entries with 1 and 40,000 entries with a 0.
00:07:07.000 --> 00:07:11.000
So this is actually simulating our pool of notes.
00:07:11.000 --> 00:07:16.000
This is the real number inside the population.
00:07:16.000 --> 00:07:21.000
And what we're going to do is take samples from this pool of notes.
00:07:21.000 --> 00:07:25.000
So pool of notes is, if you like, the truth.
00:07:25.000 --> 00:07:30.000
It's really the population, the notes circulating.
00:07:30.000 --> 00:07:36.000
For a given pool of notes, how often do we see these results?
00:07:36.000 --> 00:07:41.000
A single sample, in this case, we'll write a little function for that.
00:07:41.000 --> 00:07:50.000
Single sample is a function of this array of 0s and 1s, and a number of resamplings.
00:07:50.000 --> 00:07:57.000
And we'll say we will use another built-in R function called sample.
00:07:57.000 --> 00:08:07.000
So sample will go to our array and ask us and simply give back, picking elements at
00:08:07.000 --> 00:08:11.000
random, the number for resampling.
00:08:11.000 --> 00:08:14.000
So if you like, remember we marked the notes.
00:08:14.000 --> 00:08:18.000
Then we mixed them up with the rest of the population.
00:08:18.000 --> 00:08:22.000
And then we resampled 1,000 down here.
00:08:22.000 --> 00:08:25.000
And that's what this n resampled is.
00:08:25.000 --> 00:08:32.000
So sample draws elements from an array, this one here, this many times.
00:08:32.000 --> 00:08:41.000
So sample, the built-in function, gives us a vector back that we've called from sampling.
00:08:41.000 --> 00:08:49.000
And then we return, because we had marked notes as a 1, we can call the sum function.
00:08:49.000 --> 00:08:56.000
And that tells us how many notes in our sample were marked.
00:08:56.000 --> 00:09:08.000
So make sure that is clear before you top up your gin and tonic and go to the next slide.
00:09:08.000 --> 00:09:12.000
I would suggest you follow very closely.
00:09:12.000 --> 00:09:17.000
We've got four lines of code here that do all of the work.
00:09:17.000 --> 00:09:21.000
So really R is incredibly compact.
00:09:21.000 --> 00:09:27.000
Let me start from the third line with a reminder.
00:09:27.000 --> 00:09:34.000
Single sample is the function that we made on the previous slide.
00:09:34.000 --> 00:09:45.000
So single sample takes our big array of notes and the number of resampled notes.
00:09:45.000 --> 00:09:46.000
Okay.
00:09:46.000 --> 00:09:47.000
So what's happening here?
00:09:47.000 --> 00:09:51.000
We're building a function called sample the notes.
00:09:51.000 --> 00:09:57.000
It takes the number of notes that we're working with, how many have we marked, that was just
00:09:57.000 --> 00:10:06.000
20, a number of trials for repetitions and the number that we are resampling.
00:10:06.000 --> 00:10:10.000
The first line here was also on the previous slide.
00:10:10.000 --> 00:10:16.000
This line makes our pool of notes.
00:10:16.000 --> 00:10:20.000
And then all the work happens in this line here.
00:10:20.000 --> 00:10:31.000
We have a function that we built called single sample, which does a resampling from the population.
00:10:31.000 --> 00:10:40.000
We're going to do that sampling 100 or 1,000 times in trials times.
00:10:40.000 --> 00:10:47.000
And the replicate function, the way it works is it says how many to replicate is built
00:10:47.000 --> 00:10:53.000
into R. It says how many trials do you want to do?
00:10:53.000 --> 00:10:56.000
And what function do you want to call?
00:10:56.000 --> 00:11:07.000
So replicate calls single sample in trials times and just puts the results into a vector.
00:11:07.000 --> 00:11:19.000
So the result that we're going to get is from the real population that we have created,
00:11:19.000 --> 00:11:28.000
sample from that population, do it a number of times, and for each time we see how many
00:11:28.000 --> 00:11:33.000
notes did we actually find in the population.
00:11:33.000 --> 00:11:38.000
So to summarize it, we've generated the complete set of notes in the population and then we
00:11:38.000 --> 00:11:43.000
take a sample from this pool of notes.
00:11:43.000 --> 00:11:51.000
And we've done all of this work just for one size of the pool, for 40,000 notes or 50,000
00:11:51.000 --> 00:11:52.000
notes.
00:11:52.000 --> 00:11:54.000
And that was our hypothesis.
00:11:54.000 --> 00:12:03.000
So now we need a loop and a schleifer to try out these different hypotheses.
00:12:10.000 --> 00:12:19.000
So remember from the sampling on my first picture we have found 20 notes which are marked.
00:12:19.000 --> 00:12:23.000
So that's a constant, if you like.
00:12:23.000 --> 00:12:30.000
What we now need to do is set up a series of hypotheses to be tested.
00:12:30.000 --> 00:12:41.000
So our hypothesis is we have 20,000 notes or my hypothesis is we have 25,000 notes or
00:12:41.000 --> 00:12:43.000
and so on.
00:12:43.000 --> 00:12:45.000
So we'll put that into an array.
00:12:45.000 --> 00:12:51.000
We use the sequence function here which gives us a vector back.
00:12:51.000 --> 00:13:02.000
And my vector will be numbers starting from 20,000 going to a maximum of 120,000 and in
00:13:02.000 --> 00:13:06.000
steps of 5,000.
00:13:06.000 --> 00:13:13.000
On yetzt eine Schleifer over the values in this vector h here.
00:13:13.000 --> 00:13:21.000
So for i in the elements of h, I do the sampling.
00:13:21.000 --> 00:13:25.000
So here's my number of mark notes, my number of trials and so on.
00:13:25.000 --> 00:13:28.000
The number I resample.
00:13:28.000 --> 00:13:33.000
Then I'm checking how often was my hypothesis true.
00:13:33.000 --> 00:13:38.000
So I've put the results of the sampling into the vector a.
00:13:38.000 --> 00:13:51.000
I use r's indexing to say give me the elements of a in which a is 20, the number up here.
00:13:51.000 --> 00:13:58.000
So this expression here is a smaller vector and I ask how long is this vector.
00:13:58.000 --> 00:14:04.000
So that is the number of times my hypothesis was true.
00:14:04.000 --> 00:14:10.000
And then I add this into my success array.
00:14:10.000 --> 00:14:17.000
So success, this is simple but critical to the next slide.
00:14:17.000 --> 00:14:26.000
For every element in h, I have success, airfold.
00:14:26.000 --> 00:14:34.000
And now I can just plot out the elements in airfold in success.
00:14:34.000 --> 00:14:45.000
So this really is my success array plotted out as a histogram.
00:14:45.000 --> 00:14:50.000
So on the x axis, I have my hypothesis.
00:14:50.000 --> 00:14:58.000
The hypothesis here is that I have 50,000 notes or here I have 55,000 notes.
00:14:58.000 --> 00:15:05.000
And our original guess here is very likely to be correct or better said, it is the most
00:15:05.000 --> 00:15:07.000
likely to be correct.
00:15:07.000 --> 00:15:09.000
Have we learned anything?
00:15:09.000 --> 00:15:15.000
Yes, we have got uncertainty in our estimates.
00:15:15.000 --> 00:15:25.000
And to use the proper words, 50,000, the number you calculated is the maximum likelihood estimate.
00:15:25.000 --> 00:15:33.000
It's not actually the median estimate because this distribution is not symmetric.
00:15:33.000 --> 00:15:37.000
The median is a little bit off to the right.
00:15:37.000 --> 00:15:44.000
But we can, because we have hard numbers here, we can simply walk along the array and get
00:15:44.000 --> 00:15:48.000
out a 50% confidence interval.
00:15:48.000 --> 00:15:56.000
So I would say with a 50% probability, the estimate of the number of notes in circulation
00:15:56.000 --> 00:15:59.000
is in this range here.
00:15:59.000 --> 00:16:10.000
It might be 50,000, but it is almost as likely to be 45,000 or 55,000 in circulation.
00:16:10.000 --> 00:16:17.000
And a number like 30,000 is very unlikely or less than 30,000 or less would be the
00:16:17.000 --> 00:16:21.000
sum of these blocks here.
00:16:21.000 --> 00:16:30.000
So given this numerical simulation, I can give a very good statistical description.
00:16:30.000 --> 00:16:34.000
How hard was it to do this?
00:16:34.000 --> 00:16:38.000
Look at basically the program that does this.
00:16:38.000 --> 00:16:44.000
The lines that actually do the work are marked in yellow to make them look dramatic.
00:16:44.000 --> 00:16:51.000
And for all of this work here, it's basically a little bit of sampling here, a bit of sampling
00:16:51.000 --> 00:16:57.000
here and repetition, and then a very quick plot command.
00:16:57.000 --> 00:17:00.000
And that's all you need to get this.
00:17:00.000 --> 00:17:08.000
But now we are slowly going to make this more and more sophisticated.
00:17:08.000 --> 00:17:12.000
Let's add some information into the model.
00:17:12.000 --> 00:17:18.000
We started with pretty much no background hypothesis.
00:17:18.000 --> 00:17:22.000
I just said the number of notes is more than 1,000.
00:17:22.000 --> 00:17:25.000
That is not very informative.
00:17:25.000 --> 00:17:29.000
Let's ask the central bank what they know about this.
00:17:29.000 --> 00:17:36.000
And they say they never printed more than 70,000 notes.
00:17:36.000 --> 00:17:44.000
So what we can do is change our background estimate, our prior probability as it's called,
00:17:44.000 --> 00:17:54.000
to say that the probability of a hypothesis is zero if the hypothesis is that we have
00:17:54.000 --> 00:17:59.000
more than 70,000 notes.
00:17:59.000 --> 00:18:01.000
So that's totally reasonable.
00:18:01.000 --> 00:18:02.000
It's a bit of prior knowledge.
00:18:02.000 --> 00:18:06.000
And how do we add it in?
00:18:07.000 --> 00:18:13.000
We still loop over all the different hypotheses from 20,000 to 120,000.
00:18:13.000 --> 00:18:24.000
We do all of the sampling as before, but we're going to change our estimate of success, if
00:18:24.000 --> 00:18:26.000
you like.
00:18:26.000 --> 00:18:36.000
What we'll do is say for looping over all the possible hypotheses, if i is less than
00:18:36.000 --> 00:18:44.000
70,000, we just do the same sampling and the same check for success.
00:18:44.000 --> 00:18:56.000
But if i is greater than 70,000, we say the likelihood of success is zero.
00:18:56.000 --> 00:18:59.000
So we're just truncating our distribution.
00:18:59.000 --> 00:19:03.000
We don't have to change anything else in the code.
00:19:03.000 --> 00:19:10.000
And in fact, the maximum likelihood estimate stays at 50,000.
00:19:10.000 --> 00:19:11.000
That hasn't changed.
00:19:11.000 --> 00:19:18.000
What's changed are our confidence intervals and our estimates of probabilities.
00:19:18.000 --> 00:19:25.000
The median estimate has shifted to the left because we have lost all the contributions
00:19:25.000 --> 00:19:26.000
from here.
00:19:26.000 --> 00:19:31.000
So we can adjust our estimates.
00:19:31.000 --> 00:19:34.000
We can look at the cumulative probabilities.
00:19:34.000 --> 00:19:38.000
The first version is marked in blue with these dots here.
00:19:38.000 --> 00:19:48.000
And I've marked in for each estimated or each hypothesis here, I've got the cumulative probability.
00:19:48.000 --> 00:19:59.000
So for the first version, which does not account for the number of printed notes, we
00:19:59.000 --> 00:20:04.000
don't reach a probability of nearly one until somewhere here.
00:20:04.000 --> 00:20:11.000
When we account for the number of notes that were printed, we've accounted for all the
00:20:11.000 --> 00:20:15.000
probability when we reached 70,000.
00:20:15.000 --> 00:20:23.000
So previously, I said that from our first estimate, from this blue line here, now you
00:20:23.000 --> 00:20:33.000
see why I use cumulative probability, I said that there's about a 75% chance that the number
00:20:33.000 --> 00:20:37.000
of notes is less than 60,000.
00:20:37.000 --> 00:20:42.000
That comes from looking at these blue points here.
00:20:42.000 --> 00:20:52.000
When I put the cutoff at 70,000, I can now say that the cumulative probability of 60,000
00:20:52.000 --> 00:20:59.000
notes this point here is about 85%.
00:20:59.000 --> 00:21:05.000
So you put that in words at the start, I could say there's a 75% chance that the number of
00:21:05.000 --> 00:21:08.000
notes is less than 60,000.
00:21:08.000 --> 00:21:15.000
Now with a bit more knowledge, I say it's about 85% chance that the number of notes
00:21:15.000 --> 00:21:18.000
is less than 60,000.
00:21:18.000 --> 00:21:25.000
So in the textbooks, or if you are lucky enough to have lunch with a Bayesian statistician,
00:21:25.000 --> 00:21:37.000
they would say what I have done is use a more informative prior for my background distribution.
00:21:37.000 --> 00:21:45.000
Would you make this P of a hypothesis of the basic hypothesis even better?
00:21:45.000 --> 00:21:51.000
I wouldn't ask if I didn't know the answer was yes.
00:21:51.000 --> 00:21:57.000
Let's now use a better prior distribution.
00:21:57.000 --> 00:22:04.000
We'll ask 200 bankers how many notes they think are in circulation and average their
00:22:04.000 --> 00:22:06.000
estimates.
00:22:06.000 --> 00:22:12.000
Now I know that I can get rid of guesses that are more than 70,000.
00:22:12.000 --> 00:22:21.000
From talking to bankers, they say on average, they say there are 60,000 notes in circulation.
00:22:21.000 --> 00:22:25.000
So this is a little bit of additional information.
00:22:25.000 --> 00:22:31.000
And you could say that you don't know how good this information is.
00:22:31.000 --> 00:22:34.000
Do you trust bankers?
00:22:34.000 --> 00:22:40.000
Probably not, given the CUM X scandal or why a card going broke.
00:22:40.000 --> 00:22:45.000
But what we're going to say is we trust bankers a bit.
00:22:45.000 --> 00:22:50.000
And we're even going to put in a parameter for how much do you trust bankers?
00:22:50.000 --> 00:22:59.000
Do note that the bankers estimate, 60,000, is not the same as our maximum likelihood
00:22:59.000 --> 00:23:01.000
estimate of 50,000.
00:23:01.000 --> 00:23:06.000
So let's say we do not trust bankers very much.
00:23:06.000 --> 00:23:11.000
So we will include the information in a rather vague way.
00:23:11.000 --> 00:23:22.000
We will use a normal distribution with a mean at 60,000, but a standard deviation of 50,000.
00:23:22.000 --> 00:23:26.000
And that's what I have drawn here.
00:23:26.000 --> 00:23:34.000
You could say that sigma, the standard deviation or the width of this distribution, is a measure
00:23:34.000 --> 00:23:38.000
of the distrust in bankers.
00:23:38.000 --> 00:23:41.000
You might also say this is ad hoc.
00:23:41.000 --> 00:23:43.000
And in a sense, it is.
00:23:43.000 --> 00:23:51.000
This is why people sometimes criticize the tricks done by Bayesian statisticians.
00:23:51.000 --> 00:23:56.000
But now let's build this into our background hypothesis.
00:23:56.000 --> 00:24:00.000
So we're going to say we have the information from the bankers.
00:24:00.000 --> 00:24:03.000
We don't trust it very much.
00:24:03.000 --> 00:24:06.000
And we have a cutoff at 70,000.
00:24:06.000 --> 00:24:14.000
So we're going to cut off this distribution here at 70,000.
00:24:14.000 --> 00:24:19.000
So our background distribution will be, we'll call it a prior.
00:24:19.000 --> 00:24:23.000
That's the forehead of information.
00:24:23.000 --> 00:24:30.000
Prior will be a function which takes a number, the mean of our distribution and the standard
00:24:30.000 --> 00:24:33.000
deviation.
00:24:33.000 --> 00:24:39.000
If the number we're looking at is bigger than 70,000, return zero.
00:24:39.000 --> 00:24:43.000
That's the probability here.
00:24:43.000 --> 00:24:53.000
If the number is 70,000 or less, then we return a number from a normal distribution
00:24:53.000 --> 00:25:02.000
centered at x with the mean, sorry, normal distribution centered at 60,000, the estimate
00:25:02.000 --> 00:25:08.000
from the bankers, with a standard deviation of 50,000.
00:25:08.000 --> 00:25:17.000
And we ask, what is the value of a Gauss-Cheffertailon at this value x here?
00:25:17.000 --> 00:25:23.000
So we're going to modify our probabilities from a normal distribution.
00:25:23.000 --> 00:25:30.000
And D-norm is the standard built-in function for this.
00:25:30.000 --> 00:25:36.000
So let's compare the results before and afterwards.
00:25:36.000 --> 00:25:43.000
Similarly with a very naive prior, we had these probabilities here.
00:25:43.000 --> 00:25:50.000
With a more informative prior that's chopped off at 70,000 and includes the banker's best
00:25:50.000 --> 00:25:59.000
estimates, you would now say the most likely value is somewhere around here.
00:25:59.000 --> 00:26:06.000
And values down here and above 70,000 are either zero probability over here or very
00:26:06.000 --> 00:26:09.000
low probabilities over here.
00:26:09.000 --> 00:26:16.000
So we've gone through the process of having a very naive prior, introducing a bit more
00:26:16.000 --> 00:26:20.000
information and then more information.
00:26:20.000 --> 00:26:27.000
And what you can at least say is that given the banker's information, our mean value has
00:26:27.000 --> 00:26:36.000
shifted to the right and the uncertainty has become smaller.
00:26:36.000 --> 00:26:42.000
So to summarize all of this, we had our first version.
00:26:42.000 --> 00:26:53.000
We have an improved background or prior information, this truncated Gaussian, and that gives us
00:26:53.000 --> 00:27:00.000
a better estimate given this more informative prior.
00:27:00.000 --> 00:27:02.000
Quite neat how you can do this.
00:27:02.000 --> 00:27:08.000
And it's also a nice example of something that started out as something you could do
00:27:08.000 --> 00:27:11.000
analytically.
00:27:11.000 --> 00:27:17.000
You actually could, but once I start to put in wacky priors like this one, then it becomes
00:27:17.000 --> 00:27:23.000
much easier just to do a brute force simulation.
00:27:23.000 --> 00:27:24.000
Was it ad hoc?
00:27:24.000 --> 00:27:27.000
Yeah, in some ways it was.
00:27:27.000 --> 00:27:36.000
My adding in this prior information, the limit of 70,000, that's not so ad hoc, but the distribution
00:27:36.000 --> 00:27:41.000
based on the banker's estimates, that was ad hoc.
00:27:41.000 --> 00:27:50.000
I cannot promise you that my estimate of sigma, the standard deviation over here, I
00:27:50.000 --> 00:27:55.000
cannot promise that this is actually in any way correct.
00:27:55.000 --> 00:27:57.000
It's just my best guess.
00:27:57.000 --> 00:28:03.000
And this is always the philosophical problem with Bayesian statistics.
00:28:03.000 --> 00:28:06.000
How do you encode your prior knowledge?
00:28:06.000 --> 00:28:12.000
Here I did it with a very fat Gaussian function.
00:28:12.000 --> 00:28:17.000
It's always something you can debate, and it's always something you should know if you
00:28:17.000 --> 00:28:23.000
are judging somebody's statistical estimates.
00:28:23.000 --> 00:28:30.000
So let's summarize all of this and what we have got using various application of Bayes
00:28:30.000 --> 00:28:32.000
theorem.
00:28:32.000 --> 00:28:39.000
So in my cancer estimate, my hypothesis was you had cancer.
00:28:39.000 --> 00:28:47.000
I had a prior hypothesis that you have a one in a thousand chance of cancer.
00:28:47.000 --> 00:28:55.000
With my additional knowledge of a positive diagnostic, I then said you have a 2% chance
00:28:55.000 --> 00:28:57.000
of having cancer.
00:28:57.000 --> 00:29:05.000
In my example of components from a factory, my hypothesis was the component came from
00:29:05.000 --> 00:29:16.000
factory A. At the start, without any knowledge, we said empirically we know that 5% of the
00:29:16.000 --> 00:29:25.000
components come from factory A. With my extra knowledge of a diagnostic, that the part is
00:29:25.000 --> 00:29:32.000
broken, we had a 44% chance that the component came from factory A.
00:29:32.000 --> 00:29:39.000
And then we had the more interesting example where we have a distribution of hypotheses.
00:29:39.000 --> 00:29:45.000
So my example with money, we tried out many different hypotheses.
00:29:45.000 --> 00:29:53.000
At the start, we simply had a flat distribution for the prior probability, totally uninformative.
00:29:53.000 --> 00:30:05.000
But with the additional information, we had much more narrow distributions for the hypothesis
00:30:05.000 --> 00:30:11.000
given the data that we found 20 marked notes in the population.
00:30:11.000 --> 00:30:23.000
So as I add in information, I'm usually changing the probability of A in Bayes formula.
00:30:23.000 --> 00:30:31.000
So what one sees statistically is with Bayesian statistics, we have Bayes rule prior information
00:30:31.000 --> 00:30:37.000
and sampling lets you avoid understanding serious statistics.
00:30:37.000 --> 00:30:41.000
It lets us just build in a very ad hoc prior.
00:30:41.000 --> 00:30:48.000
And it lets us combine probabilities that would be very hard to do analytically.
00:30:48.000 --> 00:30:52.000
Don't go home and try to work with a truncated Gaussian.
00:30:52.000 --> 00:30:54.000
It's not much fun.
00:30:54.000 --> 00:31:01.000
So we can estimate errors and means and percentiles in our final sampling.
00:31:01.000 --> 00:31:08.000
And we have this formal description and rule for including these well-defined contributions
00:31:08.000 --> 00:31:16.000
like a limit of 70,000, but also adding in ad hoc ideas like expert opinions.
00:31:16.000 --> 00:31:26.000
If we can encode it as some kind of formula, from a programming point of view that I guess
00:31:26.000 --> 00:31:35.000
is at the core of this, notice that we didn't use any loops or very few loops.
00:31:35.000 --> 00:31:37.000
Everything else was implicit.
00:31:37.000 --> 00:31:45.000
So this repetition function to generate our pool, that was obviously there is a very big
00:31:45.000 --> 00:31:51.000
schleifer here over these components here, but that's all hidden away.
00:31:51.000 --> 00:32:01.000
We just asked R to give us back an array according to probabilities or numbers taken from these
00:32:01.000 --> 00:32:03.000
two values here.
00:32:03.000 --> 00:32:11.000
So we just asked R to give us back einz und nul according to these numbers of marked and
00:32:11.000 --> 00:32:13.000
unmarked notes.
00:32:13.000 --> 00:32:18.000
Similarly, our sampling has an implicit loop.
00:32:18.000 --> 00:32:27.000
So we had this number of resamplings here taken from this pool of notes.
00:32:27.000 --> 00:32:30.000
But we didn't write any of the loops explicitly.
00:32:30.000 --> 00:32:36.000
So R goes into hand-optimized code to do this kind of thing very, very quickly.
00:32:36.000 --> 00:32:46.000
And when I asked R to do replicate, that's now looping for n trials times over this function
00:32:46.000 --> 00:32:47.000
call.
00:32:47.000 --> 00:32:53.000
So R is free to do that in as fast a manner as it can.
00:32:53.000 --> 00:33:03.000
So it runs single sample n trials times and returns all the results in this vector.
00:33:03.000 --> 00:33:07.000
But we never wrote the schleifer explicitly.
00:33:07.000 --> 00:33:17.000
And the example for this lecture either will be in GitLab or is in GitLab by the time you
00:33:17.000 --> 00:33:18.000
watch this video.