WEBVTT - autoGenerated
00:00:00.000 --> 00:00:07.000
Okay, sorry for the delay due to technical problems and welcome back to statistical methods
00:00:07.000 --> 00:00:09.000
of language technology.
00:00:09.000 --> 00:00:16.000
Last lecture we talked about the engram model, a workhorse for modeling sequences of words
00:00:16.000 --> 00:00:18.000
with probabilities.
00:00:18.000 --> 00:00:25.000
And we have looked at various ways of backing up, various ways of smoothing because we wanted
00:00:25.000 --> 00:00:31.000
to solve the problem that we have to assign probability mass for unseen events.
00:00:31.000 --> 00:00:38.000
You remember we had an example where we had an unobserved trigram in a trigram model which
00:00:38.000 --> 00:00:42.000
would result in the probability of zero for the sequence if we do maximum likelihood estimation
00:00:42.000 --> 00:00:49.000
and from this a lot of mathematical formulations followed in order to alleviate this problem
00:00:49.000 --> 00:00:51.000
somehow.
00:00:51.000 --> 00:00:58.000
Then we have started with the neural language modeling which is actually one way of dealing
00:00:58.000 --> 00:00:59.000
with this.
00:00:59.000 --> 00:01:07.000
Neural language models in a way naturally smooth over the data and for this we have
00:01:07.000 --> 00:01:12.000
shortly repeated or introduced artificial neurons.
00:01:12.000 --> 00:01:18.000
For those who haven't heard about these yet, I'll try to introduce them on the level so
00:01:18.000 --> 00:01:23.000
that you have a basic understanding about what would they do in language applications.
00:01:23.000 --> 00:01:28.000
If you want to learn more about these, there are dedicated lectures for machine learning
00:01:28.000 --> 00:01:32.000
for neural networks where you can learn all the details about these things.
00:01:32.000 --> 00:01:36.000
What we're going to use in the following is two notions.
00:01:36.000 --> 00:01:39.000
One is the notion of the artificial neuron itself.
00:01:39.000 --> 00:01:41.000
It has some inputs.
00:01:41.000 --> 00:01:47.000
On these inputs are weights and what usually has been done, you sum up the product of the
00:01:47.000 --> 00:01:52.000
weights and the inputs and you send it through a non-linear function.
00:01:52.000 --> 00:01:57.000
In this case it's the sigmoid function but it could be a different function to generate
00:01:57.000 --> 00:02:04.000
a numerical output and you could either use this output as input in yet another layer
00:02:04.000 --> 00:02:09.000
of neurons and you could have many many of these neurons in parallel or you could use
00:02:09.000 --> 00:02:11.000
them for an output.
00:02:11.000 --> 00:02:16.000
For example if you want to learn a classifier that tells cat pictures from dark pictures
00:02:16.000 --> 00:02:23.000
you would want to tune these weights as to predict what it should predict to learn a
00:02:23.000 --> 00:02:24.000
function.
00:02:24.000 --> 00:02:28.000
What we also looked at last time already was the recurrence in neural networks.
00:02:28.000 --> 00:02:35.000
Here we have a typical setup in the sense that you have an input layer, might be more
00:02:35.000 --> 00:02:37.000
than one neuron but here's only one.
00:02:37.000 --> 00:02:41.000
We have a hidden layer and we have an output layer.
00:02:41.000 --> 00:02:46.000
Here is basically where it should produce cat or dog, zero or one in this sense.
00:02:46.000 --> 00:02:54.000
It would be the picture and the hidden layer can be connected to itself via time-delayed
00:02:54.000 --> 00:02:55.000
connections.
00:02:55.000 --> 00:03:01.000
This means the output of this neuron is fed of course to the output neuron but it also
00:03:01.000 --> 00:03:05.000
is fed to itself in the next time step.
00:03:05.000 --> 00:03:10.000
If you want you can imagine this as being an unfolded fashion so you could copy this
00:03:10.000 --> 00:03:20.000
thing a few times and unfold these recurrent links so that this is the situation at time
00:03:20.000 --> 00:03:26.000
step t1, this is the situation at time step t2 and whatever is the output at t1 goes into
00:03:26.000 --> 00:03:32.000
the copy of the neuron at t2 and of course you would have to copy this indefinitely for
00:03:32.000 --> 00:03:37.000
all the time steps.
00:03:37.000 --> 00:03:43.000
Then we want to use these notions to define neural language models.
00:03:43.000 --> 00:03:50.000
The idea is still that we want to assign a probability of the next word given a bunch
00:03:50.000 --> 00:03:55.000
of preceding words and we still want to use this to define the probability of an entire
00:03:55.000 --> 00:03:59.000
sequence such as a sentence.
00:03:59.000 --> 00:04:09.000
The idea of neural language models as pioneered by Benjio and others in 2003 was to basically
00:04:09.000 --> 00:04:14.000
work on how we represent the words.
00:04:14.000 --> 00:04:20.000
In the old angry model each word was a symbol, different words, different symbols, different
00:04:20.000 --> 00:04:23.000
counts, different statistics, different probabilities.
00:04:23.000 --> 00:04:31.000
Here the idea is basically that we map our vocabulary in our training data to n dimensional
00:04:31.000 --> 00:04:32.000
vectors.
00:04:32.000 --> 00:04:38.000
These vectors have in this notation m dimensions, they are real valued vectors so it could be
00:04:38.000 --> 00:04:45.000
any number minus infinite to plus infinite if you want.
00:04:45.000 --> 00:04:50.000
Every word is associated with one of these vectors and they are called low dimensional
00:04:50.000 --> 00:04:55.000
vectors but it is not low dimensional in the sense that it is a low number.
00:04:55.000 --> 00:04:59.000
We live in the three dimensional world so it is very hard to imagine something high
00:04:59.000 --> 00:05:00.000
dimensional.
00:05:00.000 --> 00:05:05.000
Here we are talking about 100 dimensions, 200 dimensions which seems a lot if you compare
00:05:05.000 --> 00:05:13.000
it to the 3D world however it is much less than if you would for example use one dimension
00:05:13.000 --> 00:05:14.000
for each word in the vocabulary.
00:05:14.000 --> 00:05:19.000
This would be millions and now we are down to a few hundred so in this case it is low
00:05:19.000 --> 00:05:22.000
dimensional.
00:05:22.000 --> 00:05:30.000
So we basically want to use much less dimensions than words in the vocabulary and then what
00:05:30.000 --> 00:05:37.000
we have to do is we have to express the probability of what is the next word given preceding words
00:05:37.000 --> 00:05:39.000
in terms of these vectors.
00:05:39.000 --> 00:05:46.000
So what we do is we want to have a function that takes the few words before our current
00:05:46.000 --> 00:05:56.000
word at time step t and the word itself and this function should return a pseudo probability
00:05:56.000 --> 00:06:01.000
kind of a conditional probability that given that we have observed stuff in the past what
00:06:01.000 --> 00:06:09.000
is the probability of the next word and since it should be a probability of course if we
00:06:09.000 --> 00:06:16.000
would substitute all possible words in here all possible words in the vocabulary as a
00:06:16.000 --> 00:06:22.000
next word then these pseudo probabilities would sum up to one otherwise it is not a
00:06:22.000 --> 00:06:23.000
probability distribution.
00:06:23.000 --> 00:06:25.000
Why do I say pseudo probability?
00:06:25.000 --> 00:06:30.000
Because it is not an actual probability because neural networks are not probabilistic models
00:06:30.000 --> 00:06:36.000
the neural networks so even if these numbers are distributed like a probability function
00:06:36.000 --> 00:06:42.000
and you can interpret them as probabilities they are not strictly speaking probabilities.
00:06:42.000 --> 00:06:50.000
Anyways so what we want to now learn here is two things one is we have to learn how
00:06:50.000 --> 00:06:56.000
these vectors pervert look like and these vectors are usually called dense vector embeddings
00:06:56.000 --> 00:07:03.000
or embeddings for short and we want to learn how to realize this kind of function with
00:07:03.000 --> 00:07:05.000
a neural network.
00:07:05.000 --> 00:07:13.000
To give you an intuition why this would be useful so the idea is basically a good embedding
00:07:13.000 --> 00:07:21.000
would reflect word similarity this is a notion we didn't have in engram language models but
00:07:21.000 --> 00:07:26.000
there are similar words there are words that behave similarly with respect to our objective
00:07:26.000 --> 00:07:30.000
and our objective is prediction of the next word language modeling.
00:07:30.000 --> 00:07:37.000
So maybe words like dog and cat are similar maybe words like an A are similar it is the
00:07:37.000 --> 00:07:42.000
same grammatical function so they can be replaced in many places and maybe the same goes for
00:07:42.000 --> 00:07:47.000
room and bedroom and is and was and running and walking so what is the intuition here
00:07:47.000 --> 00:07:52.000
the intuition here is if you have a training example such as the cat is walking in the
00:07:52.000 --> 00:08:00.000
bedroom it could transfer a little bit of his probability mass to similar unobserved
00:08:00.000 --> 00:08:04.000
but yet possible sentences such as the cat is walking in the bedroom or the dog was running
00:08:04.000 --> 00:08:12.000
in a room and so on right so this is the idea you would transfer probability mass via similarity
00:08:12.000 --> 00:08:21.000
on those vector embeddings okay so to go a little bit into details here we decompose
00:08:21.000 --> 00:08:27.000
this function that should return our pseudo probability in two parts one is basically a
00:08:27.000 --> 00:08:32.000
mapping and here it's called C and it's a little bit unfortunate because C was previously used as
00:08:32.000 --> 00:08:39.000
count and now it's the embedding matrix sorry about that but this is the original notation
00:08:39.000 --> 00:08:44.000
of the paper I decided to leave it as is because if you want to read the original paper then you're
00:08:44.000 --> 00:08:51.000
not getting confused okay so we have a mapping from any element in our vocabulary to these
00:08:51.000 --> 00:08:59.000
vectors right and basically what that gives us is a matrix and the matrix has m dimensions and
00:08:59.000 --> 00:09:06.000
this is the length of the embedding on the one hand and the number of vocabulary items which is
00:09:06.000 --> 00:09:13.000
very long on the other hand and these numbers are free parameters so these are parameters that are
00:09:13.000 --> 00:09:20.000
learned during the process so it is a lot of parameters right like you have vocabulary maybe
00:09:20.000 --> 00:09:28.000
a million words and then a hundred dimensions so you have a hundred million free parameters
00:09:28.000 --> 00:09:36.000
sounds like a lot but it's actually usually less than what you have to keep track of if you do
00:09:36.000 --> 00:09:43.000
say five gram counts four gram counts try gram counts and all the counting required for a large
00:09:43.000 --> 00:09:52.000
scale angram model okay so this is the part where basically words are getting yeah kind of
00:09:52.000 --> 00:09:59.000
translated into vector representations and then we need the function it actually does something
00:09:59.000 --> 00:10:06.000
with these vector representations in order to realize our function here so basically we model
00:10:06.000 --> 00:10:13.000
the function on the basis of a function that takes the preceding vectors of these words and
00:10:13.000 --> 00:10:24.000
then you can yeah put any of the next words in here and this would return you a probability
00:10:24.000 --> 00:10:30.000
distribution over all vocabulary words that can be predicted here which should be the entire
00:10:30.000 --> 00:10:35.000
vocabulary of course because you can always you know say anything after anything it's just
00:10:35.000 --> 00:10:48.000
the question whether it's likely or not and technically it looks like this so this actually
00:10:48.000 --> 00:10:58.000
symbolizes vectors and vectors with dimensions so these red dots are values and these values
00:10:58.000 --> 00:11:05.000
could also be outputs of neurons right so if you have a layer of neurons few neurons next to each
00:11:05.000 --> 00:11:13.000
other and they have output what comes out is a vector and yeah now let's go the step from
00:11:13.000 --> 00:11:21.000
traditional angram modeling to neural angram modeling so first we have an angram model so
00:11:21.000 --> 00:11:28.000
we look at the few words in the past and these are here so this is the word before our current word
00:11:28.000 --> 00:11:35.000
this is the word two words before this is the word three n words before depends on what your n is
00:11:35.000 --> 00:11:44.000
and here are the words and you look them up and basically produce this dense vector embedding per
00:11:44.000 --> 00:11:52.000
word and what you then do is basically you concatenate them you put them all together so
00:11:52.000 --> 00:12:00.000
this is like a single vector and then we put some neural network here so some amount of hidden
00:12:00.000 --> 00:12:08.000
layers that actually learn the function and the function would basically transfer whatever we have
00:12:08.000 --> 00:12:18.000
here and the representation to predictive values for the entire vocabulary so what we see up here
00:12:18.000 --> 00:12:25.000
is a very long vector so this vector has as many entries as there are words in the vocabulary
00:12:25.000 --> 00:12:36.000
and then we might you know pick the largest one if you want to generate one or have some distribution
00:12:36.000 --> 00:12:45.000
on what is the likely word what is a less likely word given the current past the current context
00:12:45.000 --> 00:12:54.000
yeah let's go a little bit in more details here so what comes out so let's just imagine neural
00:12:54.000 --> 00:12:59.000
network so you know it's basically inputs and it has weights and it has some function and then it
00:12:59.000 --> 00:13:08.000
passes on stuff so basically here a bunch of numbers come in then you copy them up then there's
00:13:08.000 --> 00:13:14.000
a bunch of layers so these numbers kind of get transformed and then in the end there is a direct
00:13:14.000 --> 00:13:22.000
link between the last hidden layer and this output layer and it's fully connected right so
00:13:22.000 --> 00:13:27.000
everything here would actually have a lot of connections with a lot of weights to the previous
00:13:27.000 --> 00:13:33.000
one and if you just sum up the product of the weights and the outputs you get some number here
00:13:35.000 --> 00:13:42.000
the number you get here is some number but nothing actually guarantees that what you have
00:13:42.000 --> 00:13:48.000
up here is a probability distribution for a probability distribution you would need them
00:13:49.000 --> 00:13:57.000
all sum up to one but we don't have that right it's not never guaranteed and this is and since
00:13:57.000 --> 00:14:02.000
we want to have something that resembles probabilities these sort of probabilities
00:14:02.000 --> 00:14:08.000
what we apply here is the so-called softmax it's a softmax layer who has heard about the softmax
00:14:08.000 --> 00:14:14.000
layer before okay most people it's always applied if you want to basically have a distribution over
00:14:14.000 --> 00:14:22.000
the output layers and what you do is basically you yeah take them all to the exponent and you
00:14:22.000 --> 00:14:31.000
divide by the sum of all the exponents and since you normalize by the entire sum this actually
00:14:31.000 --> 00:14:37.000
gives you values between zero and ones so this is actually how you enforce that you have something
00:14:38.000 --> 00:14:44.000
that looks like a probability up here and this vector which i call y here is actually an
00:14:44.000 --> 00:14:51.000
unnormalized um yeah log probabilities which we then take to the exponent to make them probabilities
00:14:51.000 --> 00:14:59.000
and normalize them and this is composed of a bias and a weight matrix that actually directly takes
00:15:00.000 --> 00:15:10.000
some connections between these guys and up here plus some neural network which basically does
00:15:10.000 --> 00:15:15.000
something with their entirety and transforms them and this picture is under specified in the sense
00:15:15.000 --> 00:15:22.000
that we have most computation here which is basically uh some arbitrary neural network so
00:15:22.000 --> 00:15:27.000
usually they use one hidden layer of like 50 urans or something like that but it could be
00:15:27.000 --> 00:15:36.000
arbitrarily complex in here okay so this is uh a setup on how to basically do a neural version
00:15:36.000 --> 00:15:44.000
of the engram model um if there are any questions before i lose you please ask them rather earlier
00:15:44.000 --> 00:15:53.000
than later looks good okay um let's talk a little bit about parameterization about differences
00:15:53.000 --> 00:16:00.000
and how to learn these things so what we have is a lot of parameters in this so we talked already
00:16:00.000 --> 00:16:07.000
about the parameters that are created by just having this vocabulary by embedding matrix it's
00:16:07.000 --> 00:16:13.000
full of three parameters but we of course also have parameters for the network in between and
00:16:15.000 --> 00:16:22.000
these let's call them omega and these c parameters for the matrix and the omega parameters form the
00:16:22.000 --> 00:16:30.000
overall parameter set of let's call it teta and now we want to train this and our training
00:16:30.000 --> 00:16:34.000
objective is language modeling so basically what we want to do is we want to maximize
00:16:35.000 --> 00:16:40.000
the training corpus likelihood while in fact we want to maximize the held out test sets likelihood
00:16:40.000 --> 00:16:46.000
but we can't optimize in this directly because it's a held out test set so we do this on the
00:16:46.000 --> 00:16:52.000
training corpus so the likelihood is actually we basically take whatever comes out of this
00:16:52.000 --> 00:17:02.000
function parameterized by teta and we sum this in the log space so you remember summing in the
00:17:02.000 --> 00:17:08.000
log space is like a product in the non in the normal space but summing over log props is
00:17:08.000 --> 00:17:13.000
numerically much more stable this is why we do it and then divided by the corpus length and this is
00:17:13.000 --> 00:17:20.000
basically kind of the average prediction probability for each word in our training corpus
00:17:20.000 --> 00:17:26.000
plus some regularization term this is something I can't go into basically regularization is a
00:17:26.000 --> 00:17:33.000
method to prevent overfitting to prevent total memorization of the training set which usually
00:17:33.000 --> 00:17:38.000
results in very bad performance on held out test sets and in this paper they use something which
00:17:38.000 --> 00:17:43.000
is called weight decay penalty so they want to make sure that there are no super high weights
00:17:43.000 --> 00:17:48.000
where actually everything goes through so this is how a network would memorize a particular
00:17:48.000 --> 00:17:56.000
examples but they want to have weights kind of nicely distributed and this is like a penalty
00:17:56.000 --> 00:18:01.000
term that regulates that and again I have to point to other lectures where this is discussed
00:18:01.000 --> 00:18:10.000
in more detail and how do we actually adopt the parameters what we do is we set the new parameters
00:18:10.000 --> 00:18:17.000
by taking the old parameters and changing them a little bit and we change them into the right
00:18:17.000 --> 00:18:24.000
direction and to give you an intuition so basically the parameter space
00:18:25.000 --> 00:18:33.000
would induce something like you know a hilly landscape so where actually uphill is
00:18:34.000 --> 00:18:41.000
better likelihood let's say and we can evaluate our parameter space maybe we are here
00:18:42.000 --> 00:18:50.000
and we actually know what we expected right because we know if we have training data
00:18:53.000 --> 00:18:58.000
and we have you know run this on the past and have looked at what comes out of the softmax
00:18:58.000 --> 00:19:04.000
maybe the actual word we observed is this one and ideally this would be one and the other ones
00:19:04.000 --> 00:19:14.000
would be zero and what we do is basically we take the derivative of the probability
00:19:16.000 --> 00:19:22.000
by the parameter space which gives us a direction of how would we actually walk where would we go
00:19:22.000 --> 00:19:27.000
in order to tune this more towards what we actually wanted to see in the output which is
00:19:27.000 --> 00:19:34.000
this one at the real world real word observed word and zeros at the others but of course we
00:19:34.000 --> 00:19:42.000
don't want to just set it directly to this one because this would yeah be very susceptible to
00:19:42.000 --> 00:19:48.000
the last example we learn and this is why we put a little learning rate epsilon here which is a
00:19:48.000 --> 00:19:56.000
small number like 0.00 something and this actually tunes the parameter set a little bit into the
00:19:56.000 --> 00:20:03.000
right direction and we do this several times uh for many many iterations and hope that we actually
00:20:03.000 --> 00:20:11.000
yeah gradually move our parameter configuration to a place where it actually optimizes the training
00:20:11.000 --> 00:20:21.000
objective which is likelihood of the corpus okay um so this is a table in the original paper
00:20:21.000 --> 00:20:28.000
and i decided to actually just throw it at you because it's very instructive in the sense that
00:20:28.000 --> 00:20:33.000
you think this is a big table no so if people do this kind of experiments
00:20:34.000 --> 00:20:40.000
and care to print a table like this in a paper chances are that they have in their back
00:20:40.000 --> 00:20:45.000
somewhere a table of like a thousand rows or maybe five thousand rows a lot of experiments
00:20:45.000 --> 00:20:50.000
right so this is just what they report so if you read this kind of modern paper especially in the
00:20:51.000 --> 00:20:56.000
neural learning papers what you see is the tip of the iceberg and it all sounds so easy but you
00:20:56.000 --> 00:21:02.000
don't see what actually has happened in the background here uh okay let's look at this
00:21:02.000 --> 00:21:09.000
table a little bit so what we see here is also a typical um thing in research you basically want
00:21:09.000 --> 00:21:14.000
to show that it's better so you need an evaluation measure the evaluation measure here is perplexity
00:21:15.000 --> 00:21:20.000
and they report it on the training validation and test set and you also want to show it's
00:21:20.000 --> 00:21:27.000
better so you compare this to baselines and baselines are usually the best previously known
00:21:27.000 --> 00:21:32.000
kind of models here are class-based back off models i haven't talked about these but it's
00:21:32.000 --> 00:21:38.000
basically you cluster words together in classes so that you get smaller vocabulary and therefore
00:21:38.000 --> 00:21:44.000
more training examples per class this is kneesernei back off which is uh something i decided to take
00:21:44.000 --> 00:21:49.000
out of the lecture because it's uh mathematically even more complex than what we saw before
00:21:50.000 --> 00:21:55.000
but it's basically at the time it was the best sparse engram based traditional um model
00:21:56.000 --> 00:22:04.000
here is the deleted interpolation uh and here we have different kinds of n right as an engram
00:22:05.000 --> 00:22:13.000
and then what they do up there is basically uh different ways of the neural model so they
00:22:13.000 --> 00:22:19.000
vary n they vary uh h which is the number of hidden units so this is where the most computation
00:22:19.000 --> 00:22:27.000
here is going about the very m which is the dimensionality of the vectors they just do 30
00:22:27.000 --> 00:22:32.000
and 60 here um they vary whether there are direct connections or not so this would be
00:22:33.000 --> 00:22:38.000
in this picture whether there is a direct weight matrix from these inputs to the
00:22:39.000 --> 00:22:42.000
uh softmax layer or whether you don't have those
00:22:44.000 --> 00:22:55.000
and then they vary the mix and mix is basically uh in this case so they can do a normal trigram
00:22:55.000 --> 00:23:03.000
maximum likelihood estimation model which suffers from these unseen events zeros and they okay half
00:23:03.000 --> 00:23:08.000
of the prediction comes from this one and i just take the other half from my neural
00:23:09.000 --> 00:23:17.000
language model why would you want to do that well nice thing about this engram model is actually
00:23:17.000 --> 00:23:24.000
memorizes symbolically so it actually does distinguish different words whereas our neural
00:23:24.000 --> 00:23:29.000
model maybe does not distinguish some words because maybe some of the words are mapped on
00:23:29.000 --> 00:23:35.000
the same location or almost in the same location in the in the vector space and therefore some
00:23:35.000 --> 00:23:41.000
subtle differences invert usage gets lost so maybe we you know just take a kind of best of both worlds
00:23:41.000 --> 00:23:49.000
approach here and then they report training perplexity which is not interesting validation
00:23:49.000 --> 00:23:55.000
perplexity so this is actually how you optimize your parameter space and then in the end you report
00:23:55.000 --> 00:24:03.000
and test and what comes back here is basically what is performing best is if you in fact do this
00:24:03.000 --> 00:24:09.000
mixing so this is actually always a little better than the non-mix version and in this case it's a
00:24:09.000 --> 00:24:17.000
very low number of dimensions only 30 but they need a hundred hidden units and uh yeah um this
00:24:17.000 --> 00:24:26.000
is basically the first uh neural language model uh there was this is how it was invented yeah so
00:24:26.000 --> 00:24:32.000
it's not only neural it has this mixing kind of thing and this is actually not uncommon for
00:24:32.000 --> 00:24:42.000
language modeling as we shall see a few slides down um maybe some general note on neural uh
00:24:42.000 --> 00:24:51.000
approaches so it has advantages and has drawbacks so advantages uh is actually
00:24:52.000 --> 00:24:59.000
these numbers are better right so perplexity went down from like the 300s so here was the
00:24:59.000 --> 00:25:06.000
best classical one 312 to the 250s the best you know new one was 250 so this is like a major drop
00:25:06.000 --> 00:25:13.000
in perplexity this is like an actually very good right it reduces like 20 percent of perplexity
00:25:13.000 --> 00:25:21.000
this is massive um okay but it comes at a cost because with the traditional angular models you
00:25:21.000 --> 00:25:27.000
can just count and then put them in the smoothing back of whatever formulas we had here we have a
00:25:27.000 --> 00:25:34.000
lot of computation and the computation already starts then evaluating because whenever we
00:25:35.000 --> 00:25:42.000
basically match the output of our model against the actual observation in order to
00:25:43.000 --> 00:25:47.000
compute the derivative to percolate back how we should change the weights
00:25:48.000 --> 00:25:55.000
we have to operate on probabilities so we have to basically carry out the entire softmax layer
00:25:55.000 --> 00:26:01.000
which means that we have to evaluate it for every word in order to be able to properly normalize the
00:26:01.000 --> 00:26:09.000
probability uh for the word we're interested in a lot of computation um and this was not
00:26:09.000 --> 00:26:15.000
possible just 20 years ago because it requires parallel processing to scale this to corporate
00:26:15.000 --> 00:26:22.000
sizes where it actually becomes interesting to model anything useful um yeah and then of course
00:26:22.000 --> 00:26:28.000
there's hyper parameters so there are parameters parameters are mentioned parameters so
00:26:29.000 --> 00:26:34.000
weights in matrices in networks and hyper parameters are the parameters that control
00:26:34.000 --> 00:26:39.000
the overall uh process and the overall architecture so such as number of hidden
00:26:39.000 --> 00:26:44.000
units would be a hyper parameter the learning rate you can vary the number of epochs which is how
00:26:44.000 --> 00:26:48.000
often do you go through which of course the smaller your learning rate the more often you
00:26:48.000 --> 00:26:54.000
should go through kind of influence each other what kind of regularization do you want to use
00:26:55.000 --> 00:27:02.000
what is the dimension of these embeddings and so on and uh now you could just try a lot of these
00:27:02.000 --> 00:27:08.000
but again computationally not feasible you cannot do a grid search in all the possible combinations
00:27:08.000 --> 00:27:15.000
of hyper parameters so it kind of requires experience it's kind of a black art of hyper
00:27:15.000 --> 00:27:20.000
parameter optimization of course you start with some brute force but then you have to
00:27:21.000 --> 00:27:25.000
do some your own heuristic in order to find something reasonable so again if you read these
00:27:26.000 --> 00:27:31.000
papers on neural learning you just get the tip of the iceberg they basically kind of tell you okay
00:27:31.000 --> 00:27:37.000
so in these ranges we found good stuff but how to arrive at these ranges very hard so if you try to
00:27:38.000 --> 00:27:42.000
do your own hands-on and the new problem um this is what you have to face
00:27:42.000 --> 00:27:54.000
okay okay and uh maybe one more thing on neural language modeling so what this uh model has
00:27:54.000 --> 00:28:01.000
solved is our smoothing problem and it's actually much nicer uh in the sense that we don't have to
00:28:02.000 --> 00:28:08.000
fiddle around with this pack off or some uh mixing parameters or some thresholds or some
00:28:09.000 --> 00:28:15.000
reshuffling of probabilities and probability mass it basically is all kind of in these embeddings
00:28:15.000 --> 00:28:23.000
somehow um this is very nice but what it doesn't solve yet is the thing about the finiteness of
00:28:23.000 --> 00:28:29.000
the history right so this model just like the angular models we looked at of course you could
00:28:29.000 --> 00:28:35.000
put an arbitrary number here but it's a fixed arbitrary number and of course the more you put
00:28:35.000 --> 00:28:40.000
there the more parameters you have to learn again because this creates a lot of weights here
00:28:40.000 --> 00:28:47.000
creates a bigger network here uh so we have the same issue with engram as with engram models so
00:28:47.000 --> 00:28:52.000
if you do this for too long then the training data is too sparse so this is why it usually stops at
00:28:52.000 --> 00:29:00.000
three four or five um but we also had this concept of recurrent neural network and we can use the
00:29:00.000 --> 00:29:06.000
recurrent neural network to actually do away with this limitation and to in principle look
00:29:06.000 --> 00:29:17.000
at an infinite history and uh to give you an intuition um this is a model so who has heard
00:29:17.000 --> 00:29:23.000
of virtuvac this is kind of a predecessor of virtuvac so virtuvac was invented in 2012 by
00:29:23.000 --> 00:29:30.000
mikilov and others and this was invented two years ago by mikilov and others um so it's not
00:29:30.000 --> 00:29:35.000
virtuvac but very similar uh and conceptually this is on the path to virtuvac and we're going
00:29:35.000 --> 00:29:40.000
to talk about virtuvac later in the lecture but not today um so this is the recurrent version
00:29:41.000 --> 00:29:47.000
of a language model with infinite history how does it work sorry you have an input layer
00:29:48.000 --> 00:29:54.000
and this is just like in the other ones the embedding matrix okay so a word gets
00:29:55.000 --> 00:30:03.000
looked up in the embedding matrix numbers are placed here then this is uh by some neural network
00:30:03.000 --> 00:30:11.000
uh transformed transformed into the current context and this layer here would actually represent
00:30:12.000 --> 00:30:17.000
whatever i know about the past what do i know about the past i know the last word
00:30:18.000 --> 00:30:27.000
and i know the past of the last time step so this is another way of uh painting this dashed arrow
00:30:27.000 --> 00:30:33.000
which is like a time delayed so you feed in the context at the time step minus one to the current
00:30:33.000 --> 00:30:42.000
context and you add the input here now if you have a context at time step one some of it will
00:30:42.000 --> 00:30:49.000
be available at time step two and some of it will be still available at time step three
00:30:49.000 --> 00:30:54.000
because it was in time step two and you feed it from time two and so on so in in principle this
00:30:54.000 --> 00:31:01.000
has like infinite history can memorize the entire context of whatever you have seen of course it
00:31:01.000 --> 00:31:07.000
doesn't memorize it verbatim but it memorizes it somehow convoluted in this uh representation but
00:31:07.000 --> 00:31:14.000
the information can still be there yeah and from this context which basically defines what you've
00:31:14.000 --> 00:31:21.000
seen before you have again some neural network with again an output softmax layer over the next
00:31:21.000 --> 00:31:28.000
word and again you train this um just like we trained the other one by for example stochastic
00:31:28.000 --> 00:31:36.000
gradient ascent yeah and uh again this was evaluated in a mixture situation so again this
00:31:36.000 --> 00:31:44.000
model alone actually did not improve perplexity over the baseline but uh in combination with a
00:31:44.000 --> 00:31:52.000
this is knizane five gram language model it actually did and they tried different things
00:31:52.000 --> 00:31:57.000
in their this paper so they tried different vocabulary sizes from 200 000 words to 6.4
00:31:57.000 --> 00:32:03.000
million words so the embedding matrix is actually 6.4 million words long
00:32:03.000 --> 00:32:10.000
and they measured perplexity and perplexity goes down if you add the recurrent neural network
00:32:11.000 --> 00:32:17.000
so it's like consistent nice improvements and what they also measured here is word error rate
00:32:18.000 --> 00:32:23.000
because they put this new language model into a speech recognition system so speech recognition
00:32:23.000 --> 00:32:28.000
systems take the sound and they transform it somehow into a phonetic representation
00:32:29.000 --> 00:32:35.000
and there are words that sound similar right like uh honey money sunny or whatever right
00:32:36.000 --> 00:32:42.000
and a language model is used to rescore them based on the context so you could use a language model
00:32:42.000 --> 00:32:47.000
to basically say okay i have this you know bunch of words that i have some likelihoods from the
00:32:47.000 --> 00:32:52.000
sound representations but i also have likelihoods from how do they actually fit into the context
00:32:52.000 --> 00:32:56.000
and this is where the language model is used and there you can measure word error rates on
00:32:57.000 --> 00:33:05.000
uh transcribed test sets and also here the word error rates go down um maybe one comment on this
00:33:05.000 --> 00:33:12.000
so we have seen here perplexity values which are in the range of 150 300 and before we
00:33:12.000 --> 00:33:21.000
seen perplexity values in like 300 200 are these comparable no they're not comparable because
00:33:22.000 --> 00:33:30.000
uh in this case they evaluate in different test corpora so i the ones before they ever
00:33:30.000 --> 00:33:36.000
laid on brown corpus and this is wall street journal corpus which is different um there's
00:33:36.000 --> 00:33:40.000
doesn't make sense basically to look at these absolute numbers sometimes you'll see uh badly
00:33:40.000 --> 00:33:46.000
written papers or badly written master's thesis where they do related work and people just report
00:33:46.000 --> 00:33:52.000
okay and they uh get you know 95 percent score and they get 98 score and this is why the 98 ones
00:33:52.000 --> 00:34:00.000
are better uh not necessarily if it's not the same corpus is not comparable uh you sometimes
00:34:00.000 --> 00:34:07.000
read stuff like the perplexity of english is 80 what is english stupid no possibly on this
00:34:07.000 --> 00:34:12.000
particular corpus might be 80 but this doesn't tell anything about the the language in itself
00:34:12.000 --> 00:34:19.000
right so a caveat here never compare numbers that are not comparable and they're only comparable if
00:34:19.000 --> 00:34:29.000
you like evaluate on the same conditions and in this case on the same corpus okay um a few last
00:34:29.000 --> 00:34:36.000
words on neural language models so basically what we did here is you have transformed what we could
00:34:36.000 --> 00:34:43.000
call symbolic units so words um into continuous representations into these uh embeddings
00:34:44.000 --> 00:34:50.000
and you could do this in a good way or in a bad way uh but if you do it in a good way then uh
00:34:50.000 --> 00:34:56.000
similar words have similar representations because then they can transfer these probability
00:34:56.000 --> 00:35:02.000
masses to each other and uh have you know generalization smoothing and all these good
00:35:02.000 --> 00:35:08.000
properties and what does uh similar mean similar means similar with respect to the training
00:35:08.000 --> 00:35:13.000
objective so if you train a neural dense vector embeddings on language modeling they're good for
00:35:13.000 --> 00:35:19.000
language modeling it doesn't follow that they are good for information retrieval it might be but
00:35:19.000 --> 00:35:24.000
it's now not guaranteed right so you want to always train them on the objective that you're
00:35:24.000 --> 00:35:29.000
actually interested in what we see is uh they have generally better performance than sparse
00:35:29.000 --> 00:35:36.000
angram models and it's also more compact than the model application so it's easier to fit those
00:35:36.000 --> 00:35:46.000
guys on say embedded devices smartphones whatever but it's harder to train them so it's uh the
00:35:46.000 --> 00:35:54.000
training actually takes place in large paralyzed tpu server tpu modern architectures very expensive
00:35:54.000 --> 00:36:00.000
stuff uh because the training is more expensive but also because the hyper parameter search
00:36:01.000 --> 00:36:05.000
makes it extra expensive because for every hyper parameter configuration you have to
00:36:05.000 --> 00:36:13.000
retrain many epochs maybe um and yeah what is the state of affairs so these are now the standard in
00:36:13.000 --> 00:36:21.000
nlp uh because the natural language processing uh has largely moved to neural methods and for
00:36:21.000 --> 00:36:28.000
neural methods you need vectors or like numbers you can't deal with symbols really uh so uh yeah
00:36:28.000 --> 00:36:33.000
this is why everyone is using these and current research directions actually go towards
00:36:33.000 --> 00:36:40.000
contextualized word embeddings so what we have talked about here are um we have a word and we
00:36:40.000 --> 00:36:45.000
look up the embedding right but some words have different meanings in different contexts we had
00:36:45.000 --> 00:36:51.000
this uh riverbank versus money bank example in the first lecture and you would actually want
00:36:51.000 --> 00:36:58.000
to represent these two banks differently according to the current context you observe them in and
00:36:58.000 --> 00:37:05.000
yeah this is what modern approaches like elmo bird and gtp are doing and i'm sure next year i can
00:37:06.000 --> 00:37:10.000
make this list even longer because it's a very active field of research right now
00:37:11.000 --> 00:37:15.000
active field of research but dominated by the big players because you need so much hardware to
00:37:15.000 --> 00:37:22.000
actually train these large-scale models that only google facebook apple and amazon can actually do
00:37:22.000 --> 00:37:30.000
it okay so this is all i have to say about engram language model and neural models are
00:37:30.000 --> 00:37:40.000
there any burning questions does not seem to be the case so we continue uh after this little
00:37:40.000 --> 00:37:53.000
excursion on neural model methods back to more symbolic modeling um hidden mark of models
00:37:55.000 --> 00:38:01.000
and actually this lecture is not about just telling you the latest stuff there is but the
00:38:01.000 --> 00:38:05.000
lecture is about making you understand how to arrive at this later stuff right so
00:38:06.000 --> 00:38:10.000
i had to walk you through all these engram models in order to make you understand
00:38:10.000 --> 00:38:14.000
what's so good about the neural language models and i have to walk you through hidden mark of
00:38:14.000 --> 00:38:20.000
models in order to make you understand what's so nice about conditional random fields um so yeah i
00:38:20.000 --> 00:38:25.000
had comments in the evaluations you do too many old methods uh but i don't think so i think you
00:38:25.000 --> 00:38:29.000
have to understand these old methods otherwise you cannot appreciate the new ones and then you
00:38:29.000 --> 00:38:35.000
underestimate their complexity okay who has heard about hidden mark of models before
00:38:35.000 --> 00:38:44.000
uh okay very good um so we can do this who has not heard about okay one two okay of course we
00:38:44.000 --> 00:38:51.000
do this but maybe we couldn't do this a little quickly okay so you remember mark of chains
00:38:51.000 --> 00:38:58.000
because i talked about mark of chains in the last lecture uh what was mark of chains we basically we
00:38:58.000 --> 00:39:03.000
count engram and we normalize these two probabilities we have a weighted finite state automaton
00:39:03.000 --> 00:39:10.000
and yeah we had the problem of sparse data many engrams not in training and then we talked about
00:39:10.000 --> 00:39:15.000
this back of smoothing and the back of smoothing can be realized by a mixture model for example
00:39:15.000 --> 00:39:21.000
so you have in a trigram linear interpolation model uh if you want to predict the next word
00:39:21.000 --> 00:39:26.000
on the basis of the two preceding ones you could do this by predicting it on the basis of the
00:39:26.000 --> 00:39:33.000
unigram probability the bigram probability and the trigram probability whereas you have
00:39:33.000 --> 00:39:37.000
these little lambdas here that actually appropriately mix them together such that
00:39:37.000 --> 00:39:43.000
it still stays a probability so this is the condition back here and we can use now hidden
00:39:43.000 --> 00:39:50.000
mark of models to actually model this mixture directly and also to train these lambda weights
00:39:50.000 --> 00:39:56.000
in this case and this is just one example of hidden mark of models and they're called
00:39:56.000 --> 00:40:03.000
hidden mark of models because there are hidden states and uh for our uh formula here it could
00:40:03.000 --> 00:40:10.000
look like this so we could actually say okay we have maybe in our example language of a's and b's
00:40:11.000 --> 00:40:19.000
our state where the history is a b and now next we can say add an a or a b so the next
00:40:20.000 --> 00:40:27.000
states can be either ba right because this is the b here right and then we say an a
00:40:28.000 --> 00:40:36.000
or the next day can be bb because we have a b and in the traditional mark of chain you would
00:40:36.000 --> 00:40:45.000
just directly connect this state with this one with uh you're uh accepting or generating a letter
00:40:45.000 --> 00:40:52.000
a and with some probability and you will connect these directly with the b and what we do here is
00:40:52.000 --> 00:40:58.000
we say okay now we want to model this mixture so we introduce a set of hidden states so let's call
00:40:58.000 --> 00:41:05.000
them lambda one two three and we have epsilon transitions with some probability that according
00:41:05.000 --> 00:41:11.000
to our lambdas and then we can be either here or here or here and depending on where we are
00:41:11.000 --> 00:41:17.000
we take the unigram probability the bigram probabilities or the trigram probabilities
00:41:19.000 --> 00:41:26.000
yeah and this is now suddenly a non-deterministic weighted finance state automaton
00:41:26.000 --> 00:41:33.000
because we have these epsilon transitions so while we know that if we are in this state
00:41:34.000 --> 00:41:39.000
and if we see a b we're going to be in this state after we don't know how we got there
00:41:41.000 --> 00:41:46.000
and depending on how you want to model it it's you know you either take one of these or you can
00:41:46.000 --> 00:41:54.000
take them all at the same time maybe um but we don't know the path and uh this is why it's
00:41:54.000 --> 00:42:01.000
called hidden because the sequence of states in this model is not determined anymore by the
00:42:01.000 --> 00:42:09.000
input symbols or by the emitting output symbols you can always generate it as well yeah and
00:42:09.000 --> 00:42:17.000
this is the formal definition of a hidden Markov model so it's kind of like a fine state machine
00:42:17.000 --> 00:42:22.000
so you have a set of states you have an initial state probability distribution to know where we're
00:42:22.000 --> 00:42:27.000
going to start we have a finite alphabet of input symbols and we have a transition function so we
00:42:27.000 --> 00:42:37.000
are in a state and either we see a input symbol or we see nothing then we have a weight function
00:42:37.000 --> 00:42:44.000
between zero and one whereas it's normalized per state to say it's a probability and then we go to
00:42:44.000 --> 00:42:57.000
the next state yeah and yeah it's non-deterministic and what now the and it is a probabilistic model
00:42:57.000 --> 00:43:04.000
so as opposed to these neural models it's actually really probabilities and if you talk about
00:43:04.000 --> 00:43:10.000
probabilities the question is always how is it normalized and it's actually normalized over
00:43:10.000 --> 00:43:17.000
all possible combination of state sequence and symbols so if you want to sum up stuff up to one
00:43:18.000 --> 00:43:25.000
you would actually have to generate all the possible sequences and for each sequence
00:43:26.000 --> 00:43:30.000
all the possible paths that you can take through this network through this
00:43:31.000 --> 00:43:36.000
weighted finite state automaton and this in its entirety would sum up to one so it's normalized
00:43:37.000 --> 00:43:44.000
over all possible sequences and all possible paths that can be taken for these sequences
00:43:45.000 --> 00:43:50.000
which of course means that the probability of a single sequence is usually very low
00:43:50.000 --> 00:43:57.000
but we don't care as much about this just as in angram language modeling we use this for
00:43:57.000 --> 00:44:01.000
comparing stuff what is more probable than the other one the absolute number doesn't really
00:44:01.000 --> 00:44:07.000
matter so much okay so how do we actually now compute the probability of a sequence
00:44:08.000 --> 00:44:18.000
so we have the sequence of input symbols s1 to sn is actually summing over all possible
00:44:19.000 --> 00:44:31.000
uh state sequences that are possible that we can accept this sequence what does that mean
00:44:31.000 --> 00:44:42.000
basically we have to see okay we are in state at time step one and if you are in state in
00:44:42.000 --> 00:44:49.000
type step one then there's a conditional probability of accepting the first input
00:44:49.000 --> 00:44:59.000
symbol and going to state at time step two and once we have that there's a conditional probability
00:44:59.000 --> 00:45:08.000
of going to the state at time step three by accepting the second symbol in the input symbol
00:45:08.000 --> 00:45:12.000
and once we have that and so on blah blah blah blah blah so this becomes lengthy okay
00:45:13.000 --> 00:45:22.000
and now we use uh the Markov assumption the Markov assumption was just as in the Markov chain
00:45:22.000 --> 00:45:32.000
limited horizon the next state is only dependent on the last state and the input symbol and not
00:45:32.000 --> 00:45:36.000
on like a list of last states not on a longer history you basically the history is the state
00:45:36.000 --> 00:45:44.000
nothing else one single state so uh this is why we can actually shorten down these conditional
00:45:45.000 --> 00:45:52.000
probabilities here to just the last state right so for knowing the probability to go to the
00:45:53.000 --> 00:46:00.000
second state in the time uh by accepting the first symbol and the sequence we have to know
00:46:00.000 --> 00:46:07.000
only the first state and for uh computing the probability of going to the last symbol
00:46:08.000 --> 00:46:14.000
in the last state in the state sequence which is one more than the number of symbols because you
00:46:14.000 --> 00:46:19.000
start with a state and you end with a state and between you accept symbols you only looked at the
00:46:19.000 --> 00:46:26.000
state before that so this is the Markov assumption and then we can write this as product which is
00:46:26.000 --> 00:46:34.000
basically the product over uh all of these in different steps in time and then summing
00:46:34.000 --> 00:46:40.000
over all possible state sequence because mind you it's not always the same state sequence
00:46:40.000 --> 00:46:48.000
there can be different state sequence even for the same input and uh yeah we can also do a different
00:46:48.000 --> 00:46:55.000
notation here so basically uh we go on from this state with this symbol to this state
00:46:55.000 --> 00:47:04.000
and this is basically the same just a different notation here yeah and uh this is basically
00:47:05.000 --> 00:47:12.000
the probability of a sequence and of course we want to uh have a good model that models
00:47:13.000 --> 00:47:19.000
known sequences in our training data well uh specifically we have three tasks what we want
00:47:19.000 --> 00:47:27.000
to do with hidden Markov model so uh given a model uh defined by an HMM how do we efficiently
00:47:27.000 --> 00:47:34.000
compute the probability of the observation sequence so that focuses on efficiently because
00:47:35.000 --> 00:47:41.000
enumeration of all possibilities is intractable then second task given we have this observation
00:47:41.000 --> 00:47:49.000
sequence and the model um how do we actually choose the state sequence that best explains the
00:47:49.000 --> 00:47:57.000
observations because maybe we want to use the sequence of states as labels in some subsequent
00:47:57.000 --> 00:48:03.000
process so now we're interested in the the best path somehow and the third question is yeah how
00:48:03.000 --> 00:48:08.000
do we actually get to the model so we have an observation and we have a space of possible
00:48:08.000 --> 00:48:14.000
models how can we train the model how can we actually set these weights are in these
00:48:16.000 --> 00:48:22.000
transitions between states such that we maximize again the likelihood of the training
00:48:23.000 --> 00:48:30.000
okay um let's look at how we compute the probability of some observation so basically
00:48:30.000 --> 00:48:37.000
we have the sequence of symbols and of course we can then generate all possible state sequences
00:48:38.000 --> 00:48:46.000
and then we have all possible paths and then we look at these paths and multiply all these
00:48:47.000 --> 00:48:53.000
transition probabilities and then sum over all the paths and yeah then we have an overall
00:48:53.000 --> 00:48:59.000
probability of the sequence this is yeah works mathematically but it's not efficient because
00:48:59.000 --> 00:49:10.000
if you have like n states and t is the length of the sequence then the complexity is t times n to
00:49:10.000 --> 00:49:20.000
the power of t just because you have so many paths and you have yeah and for every situation
00:49:20.000 --> 00:49:28.000
you have n to the power of t and you have t times the entire thing so it's it's too crazy okay um
00:49:28.000 --> 00:49:35.000
and that's why we need something more efficient and again so you have heard about this but who
00:49:35.000 --> 00:49:41.000
has heard about forward backward and vturby already okay much less people okay so let's do
00:49:41.000 --> 00:49:47.000
this a little bit more detail um so now the quest is how can we actually do this efficiently
00:49:48.000 --> 00:49:52.000
and this is a dynamic programming concept you remember dynamic programming from
00:49:53.000 --> 00:49:59.000
the algorithms and data structure lecture or similar lectures where you actually intelligently
00:49:59.000 --> 00:50:07.000
reuse partial results in order to compute the entire solution and you reuse them you don't
00:50:07.000 --> 00:50:14.000
recompute them all over again so uh here what we're using is we're using common sub paths and
00:50:14.000 --> 00:50:21.000
combine them into longer sub paths and for this we have to uh do a few definitions just bear with
00:50:21.000 --> 00:50:34.000
me so there's a concept that's called the forward probability so we call alpha um at time no alpha
00:50:34.000 --> 00:50:43.000
for state index i this is not a time index but this is like the if you enumerate these states
00:50:43.000 --> 00:50:50.000
the state index alpha i at time step t this is the probability
00:50:51.000 --> 00:50:55.000
if you have seen the sequence up to t minus one
00:50:57.000 --> 00:51:04.000
so that in your state at time t is the state with the state index i
00:51:04.000 --> 00:51:16.000
so the probability of finding yourself in a particular state z i at the time step t for a given
00:51:16.000 --> 00:51:26.000
sequence and uh yeah if you then use this concept and you do this for the entire sequence
00:51:27.000 --> 00:51:34.000
then you will get for the time step up case t plus one which is basically
00:51:35.000 --> 00:51:40.000
once you're done with the consumption of the entire sequence you will get a probability of
00:51:40.000 --> 00:51:49.000
being in each state after you have consumed uh the entire input and likewise you could do this
00:51:49.000 --> 00:52:01.000
in a backward fashion so we can call um better i t uh something yeah better i t the probability
00:52:02.000 --> 00:52:14.000
of finding yourself at time step t in the state i if you go backwards if you walk
00:52:14.000 --> 00:52:19.000
all the paths backwards and therefore you're looking at the sequence from the last
00:52:20.000 --> 00:52:29.000
symbol in the observation sequence up to the symbol that is currently um yeah at your time
00:52:29.000 --> 00:52:35.000
step of interest okay so these are just definitions and now let's do the forward procedure so the
00:52:35.000 --> 00:52:43.000
procedure goes as follows what is the probability of being in a state before we actually consume
00:52:43.000 --> 00:52:52.000
a symbol well this is actually the start state initial distribution right remember we had this
00:52:54.000 --> 00:53:00.000
pi which basically gives us a distribution to use a particular state a start state
00:53:01.000 --> 00:53:08.000
so at time step one we haven't seen any symbols yet this is actually for all states the forward
00:53:09.000 --> 00:53:18.000
um probability and then we have an induction step so if we want to learn about the probability alpha
00:53:18.000 --> 00:53:24.000
and the forward procedure to be in state with index j at the time step t plus one
00:53:26.000 --> 00:53:33.000
we have to ask ourselves how can we get there well we have to use a transition to our state with
00:53:33.000 --> 00:53:41.000
index j and of course we have to use a symbol which is currently observed which is the symbol
00:53:41.000 --> 00:53:51.000
at time step t and we could come from all other states or all states you could come from the
00:53:51.000 --> 00:53:58.000
same state as well we can come from all states and what is the probability of being there well
00:53:58.000 --> 00:54:05.000
this is just the probability of being at a given state for the time step before
00:54:07.000 --> 00:54:14.000
and now if we sum over all possible states uh to get at this new state in the new time
00:54:15.000 --> 00:54:23.000
this is what we this is how we basically fill our table of uh forward uh probabilities so here's
00:54:24.000 --> 00:54:32.000
a typical dynamic programming thing where you actually use previous things for the next time
00:54:32.000 --> 00:54:39.000
step yeah and then what is the actual results so mind you we wanted to have the probability of the
00:54:39.000 --> 00:54:45.000
entire observation sequence giving the model um yeah the probability of the sequence is then
00:54:45.000 --> 00:54:54.000
we look at what is the probability of being in each of these states at the last time step
00:54:56.000 --> 00:55:00.000
and then we sum over all these states and then it's basically all the paths
00:55:02.000 --> 00:55:10.000
and if we look at how many computations do we need so basically we so this is the
00:55:11.000 --> 00:55:16.000
expensive thing here right so what we do here is basically we have to loop over all states
00:55:18.000 --> 00:55:23.000
for a single index here and of course we have to do this for all indices so this is actually
00:55:23.000 --> 00:55:30.000
quadratic in the number of states and how do how often do we have to do it well as long as the
00:55:30.000 --> 00:55:38.000
sequence is so this is why the complexity is t length of sequence times n square so uh number
00:55:38.000 --> 00:55:44.000
of states to the square and this is of course much better than the exponential explosion that we
00:55:44.000 --> 00:55:50.000
seen in the naive case yeah and uh if you want to do it backwards we can do it backwards
00:55:51.000 --> 00:56:00.000
yeah same thing just uh be careful if you look at this at home uh that the initial state
00:56:00.000 --> 00:56:07.000
distribution is now here and not up there but small detail okay and we can actually combine these
00:56:07.000 --> 00:56:17.000
so actually we could also write the probability of sequences by summing over all possible states
00:56:18.000 --> 00:56:30.000
in all possible times coming from the forward and from the backward yeah um okay so this is basically
00:56:30.000 --> 00:56:38.000
an efficient way of computing the probability of a sequence what we also want to use this for
00:56:38.000 --> 00:56:46.000
is what is called decoding and decoding is whenever you have an input which is a code
00:56:47.000 --> 00:56:55.000
and you want to find some structural information on it which is like decoded um so in translation
00:56:55.000 --> 00:57:01.000
you have decoders um you're going to see an application of hmm's just in the next chapter
00:57:01.000 --> 00:57:07.000
which is part of speech tagging which basically assigns kinds of words for words in the sequence
00:57:07.000 --> 00:57:13.000
kinds of words just as noun verb adjective these kind of things and we're going to associate this
00:57:13.000 --> 00:57:22.000
information with the actual states in the hmm so decoding is finding the best state sequence
00:57:22.000 --> 00:57:29.000
so what sequence through our hmm is actually the most likely one given the input
00:57:31.000 --> 00:57:38.000
so formally what we're going to do is the following we want to find the maximum path
00:57:39.000 --> 00:57:47.000
which is the argument of paths so this is like a state sequence
00:57:47.000 --> 00:57:54.000
in times that maximizes the probability that given we have this input sequence
00:57:54.000 --> 00:58:04.000
s1 to st that this is actually the the the maximum path uh like this sequence and this is actually
00:58:06.000 --> 00:58:11.000
if you're in the argmax space equivalent to the joint probability so this is a conditional
00:58:12.000 --> 00:58:17.000
probability and you could rewrite this with the bayesian rule into a joint probability
00:58:19.000 --> 00:58:24.000
divided by a factor which is the same for everyone which is the probability of the sequence itself
00:58:24.000 --> 00:58:31.000
which we're not interested in so if you're doing argmax basically it will still maximize the same
00:58:31.000 --> 00:58:40.000
thing so we we can look at the joint probability here and now we do a dynamic programming solution
00:58:41.000 --> 00:58:45.000
which is called the viterbi algorithm who has heard about viterbi algorithm before
00:58:45.000 --> 00:58:50.000
yeah this is something you usually learn together with hmm's which actually
00:58:52.000 --> 00:58:59.000
gets you effectively and computationally efficiently to the maximum sequence and for
00:58:59.000 --> 00:59:10.000
this we define our little delta no is it the delta yeah small delta uh delta at uh for every
00:59:10.000 --> 00:59:19.000
state index i at time step t which actually does not record the probability of being in this state
00:59:20.000 --> 00:59:25.000
so it's not alpha it was the probability of being in the state at the time step
00:59:25.000 --> 00:59:33.000
but the probability of the maximum path to this state so in this alpha case and the forward
00:59:33.000 --> 00:59:39.000
procedure we basically looked at all the possible paths to get there and for this delta we only look
00:59:39.000 --> 00:59:49.000
at the maximum path and therefore it's not summing over all possible state sequences but
00:59:49.000 --> 00:59:58.000
it's maximizing it such that at time step p we're looking at this sequence of states
00:59:59.000 --> 01:00:07.000
and this input sequence and we're interested in being in the state with the index i at the time
01:00:07.000 --> 01:00:14.000
step t and we want to have the maximum one and this is the viterbi algorithm very similar to
01:00:14.000 --> 01:00:24.000
the forward procedure just that you don't use a sum but a max uh so let's walk through again um
01:00:25.000 --> 01:00:33.000
delta at time step i uh time step one for state index i so we haven't seen any
01:00:34.000 --> 01:00:41.000
sequence yet so this is basically the initial state um distribution so uh this basically
01:00:41.000 --> 01:00:50.000
follows from our pi i for each state and then now if you're interested in the probability of
01:00:50.000 --> 01:01:03.000
the maximum path leading to state j at time step t plus one we look at the maximum paths to arrive
01:01:03.000 --> 01:01:14.000
at every state i we look at the probability of going from state i to our state j we're interested
01:01:14.000 --> 01:01:27.000
in under consumption of the symbol at time step t and we decide on taking the maximum one
01:01:27.000 --> 01:01:35.000
basically just memorize the maximum path and so that we don't forget what the maximum path was
01:01:37.000 --> 01:01:46.000
we um we store a so-called backtrace matrix so in this uh uh phi j here we basically
01:01:47.000 --> 01:01:57.000
memorize where we came from so we memorize uh what was actually uh the state
01:01:59.000 --> 01:02:06.000
z i we came from in order to create the new delta j
01:02:08.000 --> 01:02:12.000
we're going to have an example in a minute where we actually see how this is executed and why it's
01:02:12.000 --> 01:02:22.000
important and then let's look at the end of it so basically we uh want to know okay what is
01:02:22.000 --> 01:02:31.000
the maximum path going over a sequence of states so what this computes is basically for every
01:02:31.000 --> 01:02:40.000
possible state in our list of states what is the maximum path per state and uh at the last time
01:02:40.000 --> 01:02:48.000
step we basically take the largest value we find so the largest delta i for all i's and this is
01:02:48.000 --> 01:02:56.000
basically our last state and then we can read the sequence backwards because we memorized it
01:02:57.000 --> 01:03:03.000
and actually multiply out the probabilities and emit the probability of the state okay
01:03:04.000 --> 01:03:10.000
in case you wonder about ties about you know stuff where if you take the maximum
01:03:10.000 --> 01:03:16.000
maybe there are several ones that have are equally good um yeah there's different ways
01:03:16.000 --> 01:03:22.000
of approaching this so sometimes people store and best lists some people just resolve this
01:03:22.000 --> 01:03:29.000
randomly for most nlp application actually it doesn't really matter so much like in the end of
01:03:29.000 --> 01:03:42.000
the day okay um let's do an example so this is a hidden markov model uh hidden markov model
01:03:43.000 --> 01:03:47.000
what what is hidden about it well we have two states one is called n the other one's called v
01:03:48.000 --> 01:03:56.000
and it's definitely the case that uh if i tell you a sequence of a's and b's this is our language
01:03:57.000 --> 01:04:05.000
uh you cannot tell me what was the path i used to generate it um and we have the initial state
01:04:05.000 --> 01:04:10.000
distribution which is basically this is the start state so this is makes it simpler okay
01:04:11.000 --> 01:04:20.000
and uh yeah so what basically uh uh how do we determine now what is the likeliest
01:04:21.000 --> 01:04:29.000
sequence of states for a given uh input let's go okay so observation sequence is b b b a
01:04:30.000 --> 01:04:36.000
and uh for this we basically have to keep track for each state in every time step
01:04:37.000 --> 01:04:47.000
what is the maximum path to get at this state so for the first time step we haven't read any input
01:04:48.000 --> 01:04:54.000
the probability of being in n is one the probability of being in v is zero
01:04:55.000 --> 01:05:05.000
now we have read b we can go to n or we can go to v
01:05:08.000 --> 01:05:09.000
and we can come from n
01:05:10.000 --> 01:05:19.000
or we can come from v and what we record here is always the maximum that we end with n
01:05:19.000 --> 01:05:26.000
and here's the maximum that we end with v so how could we go from with a b to n
01:05:29.000 --> 01:05:39.000
so we can go with a b either like this to n or we can go with a b to n like this
01:05:39.000 --> 01:05:50.000
this has probability 0.1 the probability of being in v before is zero so this is zero okay
01:05:52.000 --> 01:05:58.000
the probability of being in n before was one the probability of then using this one is 0.2 so this
01:05:58.000 --> 01:06:07.000
is why the max path after this one to an n is 0.2 likewise let's look at how can we arrive
01:06:08.000 --> 01:06:15.000
in a max path to v well we could either have walked from v to v so this would be zero times
01:06:15.000 --> 01:06:29.000
0.5 or we could have walked from n via a b to v which is one times 0.1 which is larger than zero
01:06:29.000 --> 01:06:38.000
this is why we put the 0.1 here and this is the maximum sequence okay then we go on so if you see
01:06:38.000 --> 01:06:48.000
the second b we could either come from here or from here and what we basically do is okay
01:06:49.000 --> 01:06:58.000
we do this times the transition probability of n to n we do this times the transition probability
01:06:58.000 --> 01:07:05.000
of v to n and we take the larger one and in this case the larger one is 0.2 times 0.2
01:07:05.000 --> 01:07:14.000
which is 0.04 which is why we decide on copying the n here and the last n is always given because
01:07:14.000 --> 01:07:22.000
this is the n sequence it ends in n and likewise we do the same thing down here and we can do this
01:07:22.000 --> 01:07:28.000
for all the steps and yeah maybe what is interesting like up to here it looks like this
01:07:28.000 --> 01:07:34.000
n sequence is always like more probable and then it kind of turns around so in the end these two
01:07:34.000 --> 01:07:40.000
sequences are actually equally likely and i'm not going to walk you through all the steps now you
01:07:40.000 --> 01:07:47.000
could do this at home but the the the the capture idea the important idea is basically to realize
01:07:47.000 --> 01:07:53.000
that we always look at how like where do we end uh what state are we arriving and there's a max
01:07:53.000 --> 01:08:02.000
path to this state uh during the process and why is this more efficient because we have this
01:08:04.000 --> 01:08:10.000
search space if you like would enumerate all the paths then we would actually have this giant uh
01:08:11.000 --> 01:08:17.000
tree of possibilities and uh what vturbi actually does is to cut hopeless paths
01:08:18.000 --> 01:08:25.000
from further exploration so uh whereas in this uh inefficient case of uh computing
01:08:27.000 --> 01:08:33.000
the maximum path by enumerating everything what we would have is like a full you know in this case
01:08:33.000 --> 01:08:37.000
binary tree is just binary because we have two states we have more states that would
01:08:37.000 --> 01:08:43.000
forget even more and it would be the full binary tree and this is why it's exponential
01:08:43.000 --> 01:08:50.000
and what uh vturbi does is basically keep this always down to two previous
01:08:51.000 --> 01:08:59.000
or like number of state previous possibilities explore states by states and then prune again
01:08:59.000 --> 01:09:05.000
and and and pick the maximum one so this is why it's computationally more efficient just to give
01:09:05.000 --> 01:09:12.000
you a different perspective on that okay are there more questions for forward backward and
01:09:12.000 --> 01:09:21.000
vturbi no okay yes so is the determine from the result of the term algorithm is always
01:09:21.000 --> 01:09:27.000
smaller than the result of the backward or forward algorithm right yes smaller or equal but usually
01:09:27.000 --> 01:09:40.000
smaller yes yeah yeah okay let's do hmm training so what is training here so we observe we have a
01:09:40.000 --> 01:09:47.000
fixed structure so the number of states is fixed right this is a hyper parameter uh and now we want
01:09:47.000 --> 01:09:51.000
to optimize the parameters so the parameters are the transition probabilities on these arcs
01:09:52.000 --> 01:09:56.000
according to the observations and what we're going to look at is now bomb well so forward
01:09:56.000 --> 01:10:03.000
backward algorithm training and uh the idea is basically an idea that we also see with other
01:10:04.000 --> 01:10:08.000
training algorithms this is why i do this a little bit in more detail because it's still
01:10:08.000 --> 01:10:16.000
understandable here um is that you initialize them somehow randomly and try to improve them
01:10:17.000 --> 01:10:22.000
and uh i didn't say this explicitly but in the neural language modeling
01:10:22.000 --> 01:10:27.000
yeah you also have to start with some parameters so you initialize them randomly right um
01:10:29.000 --> 01:10:36.000
and uh yeah what is then improve them you basically have to have a procedure that actually
01:10:36.000 --> 01:10:42.000
is guaranteed to uh converge to something better and this could be either because it has some
01:10:42.000 --> 01:10:48.000
mathematical properties to converge to something better or because you do a trial error approach
01:10:48.000 --> 01:10:53.000
you could also say okay let's randomly wiggle a little bit let's you know change them a little bit
01:10:53.000 --> 01:10:57.000
see if it's better if it's not we go back right this would also converge to something better in
01:10:57.000 --> 01:11:03.000
the end although it's not very directed um and of course the question is why is this necessary
01:11:04.000 --> 01:11:11.000
because i mean in the mark of chain it wasn't necessary right we could just
01:11:12.000 --> 01:11:17.000
count the number of transitions number of n grams number of conditional probabilities
01:11:19.000 --> 01:11:26.000
why do we have to do this random stuff here and for this um we have to take a look uh at these
01:11:26.000 --> 01:11:32.000
formulas again so let's recap um this is a mark of chain why is the mark of chain because it's
01:11:32.000 --> 01:11:39.000
deterministic so you never have uh two possibilities uh for the same input symbol
01:11:41.000 --> 01:11:48.000
um and if you now fix this architecture let's fix it like this and we have a training sequence
01:11:49.000 --> 01:11:55.000
what we basically have to do is we just have to count how often do we walk which path
01:11:57.000 --> 01:12:03.000
and yeah it's deterministic so we can see what the path was and then we can normalize appropriately
01:12:03.000 --> 01:12:11.000
right so how do we train this basically we run the sequence in this case a b b a a b a
01:12:12.000 --> 01:12:23.000
through this thing and then we observe okay we went from n to v using a five times we went from n
01:12:23.000 --> 01:12:34.000
uh to n using b three times these are all the outgoing arcs of n so these are eight right so
01:12:34.000 --> 01:12:40.000
that's then call this five over eight and let's call this three over eight and likewise outgoing
01:12:40.000 --> 01:12:47.000
arcs of v uh we have a total of four and two times we used um this one and two times we use
01:12:47.000 --> 01:12:56.000
this one this is why they're both the same probability of two over four or 0.5 um yeah
01:12:56.000 --> 01:13:03.000
but with hmm we can't do this because it's not deterministic so how do we do this if we have
01:13:03.000 --> 01:13:08.000
you know multiple possible paths are we using like do we assume that all possible transitions
01:13:08.000 --> 01:13:14.000
are used at the same time and how do we actually start the process because yeah uh let's look at
01:13:14.000 --> 01:13:22.000
this a little bit more closer um let's simplify the problem let's not make this arbitrary all
01:13:22.000 --> 01:13:28.000
possible whatever but let's focus at like particular uh transition in a particular path
01:13:29.000 --> 01:13:36.000
so maybe we have only two possible paths through an hmm and let's say the first path has a
01:13:36.000 --> 01:13:43.000
probability of one third using a particular transition t one and the second path has the
01:13:43.000 --> 01:13:49.000
probability of two thirds using a particular transition t two and now if we observe
01:13:51.000 --> 01:13:57.000
that we have these inputs and we have these two possible paths we could actually increase
01:13:58.000 --> 01:14:05.000
the count of this t one by one third because the probability that we took it was one third
01:14:05.000 --> 01:14:10.000
and we can increase the fractional count uh of the t two by two thirds
01:14:10.000 --> 01:14:16.000
makes sense yeah why not right we could do fractional counts who cares doesn't have to be
01:14:16.000 --> 01:14:22.000
integers and the likelier it is that we take it the more it gets increased if we actually observe
01:14:22.000 --> 01:14:27.000
something that would actually use these paths okay uh in the general case it looks like this so we
01:14:27.000 --> 01:14:35.000
want to count the transition from state i to sound state j under consumption of a particular input
01:14:35.000 --> 01:14:48.000
symbol and there you would actually have to sum overall the state sequences that you can
01:14:49.000 --> 01:15:00.000
take in order to consume a given input sequence and then you look at how often you actually take
01:15:00.000 --> 01:15:14.000
this particular transition so this actually gives you the probability of uh the particular paths
01:15:16.000 --> 01:15:23.000
and this one gives you how often this transition is occurring in this path and uh
01:15:23.000 --> 01:15:35.000
uh now that you want to count this you have to normalize by the
01:15:37.000 --> 01:15:44.000
probability of the observation itself in order to get the joint probability
01:15:46.000 --> 01:15:50.000
so if you want to take this out of the condition and make a joint probability you have to like
01:15:50.000 --> 01:15:59.000
normalize this out probability computation rules um yeah so let's look at this again the count
01:16:02.000 --> 01:16:11.000
is actually determined by how of how probable is the path and how often is the transition in the
01:16:11.000 --> 01:16:21.000
path and for the probability of the path given the uh input sequence we have to normalize by
01:16:21.000 --> 01:16:31.000
the probability of this input sequence and uh yeah so actually what is um
01:16:31.000 --> 01:16:41.000
um the probability of a transition well the probability of a transition is actually what
01:16:41.000 --> 01:16:51.000
we do is we count how often we saw it and we count how often we saw total outgoing transitions
01:16:51.000 --> 01:16:59.000
from state i which gives us this probability distribution of outgoing edges on the c1
01:17:00.000 --> 01:17:08.000
and now it becomes circular because for computing the count we have to compute the probability of
01:17:08.000 --> 01:17:13.000
the path normalized by the observation sequence time so often we saw it the
01:17:15.000 --> 01:17:19.000
probability of the observation sequence is actually we know that right it's the forward
01:17:19.000 --> 01:17:26.000
procedure so we can sum over all possible state sequences and then do the multiplication of all
01:17:26.000 --> 01:17:32.000
the probabilities on the single time steps and for this we need the transition probabilities
01:17:32.000 --> 01:17:40.000
and the transition probabilities are actually defined as how often have we seen this one as
01:17:40.000 --> 01:17:47.000
opposed to all other outgoing ones and there becomes circular right and this is why you have
01:17:47.000 --> 01:17:53.000
to kick it off somehow and this is why you need this kind of uh random initialization and the
01:17:53.000 --> 01:17:59.000
procedure that makes sure that you actually go from your random initialization to something
01:17:59.000 --> 01:18:08.000
more useful in some iterations and in this case uh the algorithm used for this is called expectation
01:18:08.000 --> 01:18:15.000
maximization who has heard of expectation maximization before okay very few people okay so
01:18:15.000 --> 01:18:22.000
idea is basically it's kind of a bootstrapping so you basically it gets you out of the swamp on your
01:18:23.000 --> 01:18:30.000
bootstrap right so what you do is basically you kind of want to
01:18:32.000 --> 01:18:40.000
yeah converge to something that's better by iterative time steps and on top level expectation
01:18:40.000 --> 01:18:48.000
maximization looks like this so basically what you want to optimize is cross entropy so basically
01:18:49.000 --> 01:18:55.000
a measure for how well your model actually models the sequence so if you call this cross
01:18:55.000 --> 01:19:04.000
entropy or likelihood conceptually same thing and that's called the old one is infinite and you
01:19:04.000 --> 01:19:11.000
want to lower this right so like lower is better for cross entropy and then we guess the parameters
01:19:11.000 --> 01:19:18.000
we basically roll the dice and you know random initialization and then we have some re-estimate
01:19:18.000 --> 01:19:27.000
parameters function and as long as we don't get more improvements so as long as we still get
01:19:27.000 --> 01:19:33.000
improvements as long as the the new cross entropy is not kind of like the same as the old one
01:19:34.000 --> 01:19:41.000
we basically memorize the old one as what was the new one and get a new one by
01:19:41.000 --> 01:19:51.000
re-estimating the parameters and computing the cross entropy and now it's not enough to say okay
01:19:51.000 --> 01:19:55.000
we have this problem and then we use expectation maximization what you actually have to do and
01:19:55.000 --> 01:20:04.000
this is what the big thing with Baumwelsch was is actually to prove that the method that
01:20:06.000 --> 01:20:15.000
you're using here is actually guaranteed to lower the cross entropy
01:20:16.000 --> 01:20:22.000
so if it's guaranteed that this re-estimate parameter step lowers the cross entropy then
01:20:23.000 --> 01:20:30.000
the algorithm converges to a local minimum or minimum maximum optimize
01:20:31.000 --> 01:20:38.000
towards zero towards whatever right it's interchangeably yeah and yeah it has been
01:20:38.000 --> 01:20:45.000
proven and this is why it works but i spare you the proof i just show you actually how it's done
01:20:46.000 --> 01:20:54.000
on an example so let's say this is actually the true HMM we're cheating now we know what
01:20:54.000 --> 01:21:01.000
actually the result should be because we used this one to generate a sequence and we generated a
01:21:01.000 --> 01:21:07.000
sequence a b a b b it's a very short sequence usually you train this along the sequences but
01:21:07.000 --> 01:21:13.000
whatever for the small example and we can also initialize the same architecture randomly so what
01:21:13.000 --> 01:21:19.000
we have to set is all the free parameters and actually there are three free parameters
01:21:20.000 --> 01:21:24.000
this one this one and this one this has to be one because it's the only outgoing arc
01:21:26.000 --> 01:21:32.000
so you don't have to like put random numbers everywhere you have to put random numbers
01:21:32.000 --> 01:21:39.000
wherever there are still free parameters and also these three guys actually are depending
01:21:39.000 --> 01:21:44.000
on each other right because the sum is one so you basically would have two random ones
01:21:44.000 --> 01:21:50.000
and the third one would basically follow from that okay and um let's say we have a training
01:21:50.000 --> 01:21:56.000
sequence a b a b b and for the sake of simplicity we're now enumerating all the possible paths
01:21:57.000 --> 01:22:01.000
which is not how it's done because of course enumerating all the possible path for large
01:22:02.000 --> 01:22:08.000
HMM's and large number of input symbols whatsoever will become intractable again but in this case we
01:22:08.000 --> 01:22:20.000
can because it's just four we can either go n b n v and n or we can not go via v
01:22:21.000 --> 01:22:26.000
with this a b combination so every a b allows you basically to go to v and back
01:22:28.000 --> 01:22:35.000
and but you could also stay in n right so this is why for a input sequence with two ab's you can
01:22:35.000 --> 01:22:44.000
either go to v once or twice or not and if you go once you can decide whether you go with the
01:22:44.000 --> 01:22:51.000
first one or with the second one and this is what's reflected here and uh yeah what we do is
01:22:51.000 --> 01:23:07.000
we basically uh carry out a procedure where we uh um yeah look at uh the current probabilities
01:23:08.000 --> 01:23:15.000
of these paths and do fractional counting of how often do we visit stuff according to the
01:23:15.000 --> 01:23:22.000
current probability of these paths so basically we're using this idea of fractional counting
01:23:24.000 --> 01:23:29.000
and we start randomly and then we renormalize and this actually gives us
01:23:30.000 --> 01:23:35.000
an improved model and i spare you the proof that it actually really always improves but it does
01:23:36.000 --> 01:23:40.000
okay so what do we have to do here we have to look at all the paths
01:23:40.000 --> 01:23:48.000
and we have to compute the probability of the path which is simple we just basically
01:23:50.000 --> 01:23:55.000
go like okay this is our path and v and v and n so and
01:23:55.000 --> 01:24:09.000
then this times this times this times this times this result is this number here okay
01:24:10.000 --> 01:24:20.000
and we do this for all paths and then we look at the probability for all the transitions
01:24:20.000 --> 01:24:30.000
and the probability for all the transitions is actually the number the probability of the path
01:24:30.000 --> 01:24:42.000
times the number of times we took this transition so um the probability of going to v under observation
01:24:42.000 --> 01:24:55.000
of a and being in state n is uh well for this path n v and v and n we do this two times right
01:24:55.000 --> 01:25:04.000
we always go from n to v and we see the a we do this here we do this here um so training sequence
01:25:04.000 --> 01:25:16.000
was a b a b b remember so this is basically two times this one okay in this path we see only ones
01:25:16.000 --> 01:25:20.000
and this times we see it zero times this is why zero times this one is zero okay and we do this
01:25:21.000 --> 01:25:31.000
for all the possible transitions and now we sum over all paths so the current probability of the
01:25:31.000 --> 01:25:41.000
sequence under the given model is the sum over the probabilities of all the paths which is the
01:25:41.000 --> 01:25:52.000
sum of this one and uh yeah this is actually uh we could also sum this up for these so this is
01:25:52.000 --> 01:25:58.000
the entire probability mass that is actually assigned to each one of them and this is not
01:25:58.000 --> 01:26:06.000
a probability because we have to normalize this to make this a probability so we have to normalize
01:26:06.000 --> 01:26:13.000
so that all the outgoing edges and now we're totally talking about these guys here actually
01:26:13.000 --> 01:26:25.000
sum up to one so what we do is basically we sum up everything that ends no that goes from n so this
01:26:25.000 --> 01:26:36.000
one this one and this one and we divide these by the respective sum and we also do this for
01:26:36.000 --> 01:26:41.000
the v state but this is always trivial because this is only one and we divide it by itself so
01:26:41.000 --> 01:26:49.000
this results in one and this divided by the sum of everything that goes out of n is this
01:26:50.000 --> 01:26:56.000
this divided by this is this this divided by this is this so what we get is basically again
01:26:57.000 --> 01:27:07.000
a probability distribution outgoing edges from n sum up to one outgoing edges from v sum up to one
01:27:08.000 --> 01:27:15.000
and these are actually this is the new uh probability we're gonna set for these transitions
01:27:16.000 --> 01:27:22.000
so we have 0.06, 36 and 58 and we started with
01:27:24.000 --> 01:27:36.000
0.04 and 48 and 48 so it went to some other direction and we can do this again and if you
01:27:36.000 --> 01:27:42.000
do this again then we get a bunch of numbers down here same uh computation and so this transition
01:27:42.000 --> 01:27:50.000
here actually becomes larger it goes to 0.1 and this one gets lower and lower and this one
01:27:50.000 --> 01:27:59.000
also gets a little lower so basically uh yeah it adjusts and what happens is that the probability
01:27:59.000 --> 01:28:08.000
of the sequence actually goes up so the probability of the sequence for our initial random estimate
01:28:08.000 --> 01:28:17.000
here the probability of the sequence was uh where do we have it yeah it's the sum of all paths so
01:28:17.000 --> 01:28:22.000
this is the probability of the sequence the probability of the sequence for the next iteration
01:28:22.000 --> 01:28:27.000
is already higher if you want to do this at home the probability for the next one should be this one
01:28:28.000 --> 01:28:34.000
and uh yeah it gradually converges to some local maximum
01:28:34.000 --> 01:28:42.000
but it's a problem it's only local uh it can't decide for critical points uh so there are
01:28:42.000 --> 01:28:48.000
situations where it doesn't know where to go and get stuck uh small training sequences are a problem
01:28:48.000 --> 01:28:56.000
and uh it's a local greedy search not a global optimizer but uh i think in the interest of time
01:28:57.000 --> 01:29:04.000
i have to stop here and continue next week are there any more questions immediately for this one
01:29:06.000 --> 01:29:09.000
okay then thank you very much see you next week