Бумага, которую я нашел разъясняющей в отношении максимизации ожидания, представляет собой байесовский метод K-средних как алгоритм «максимизация-ожидание» (pdf) по Welling и Kurihara.
Предположим, у нас есть вероятностная модель с x наблюдениями, z скрытыми случайными величинами и суммой θ параметров. Нам дан набор данных D и мы вынуждены (высшими силами) установить p (p(x,z,θ)xzθD .p(z,θ|D)
1. Выборка Гиббса
Мы можем аппроксимировать выборкой. Выборка Гиббса дает p ( z , θ | D ) , чередуя:p(z,θ|D)p(z,θ|D)
θ∼p(θ|z,D)z∼p(z|θ,D)
2. Variational Bayes
Instead, we can try to establish a distribution q(θ) and q(z) and minimize the difference with the distribution we are after p(θ,z|D). The difference between distributions has a convenient fancy name, the KL-divergence. To minimize KL[q(θ)q(z)||p(θ,z|D)] we update:
q(θ)∝exp(E[logp(θ,z,D)]q(z))q(z)∝exp(E[logp(θ,z,D)]q(θ))
3. Expectation-Maximization
To come up with full-fledged probability distributions for both z and θ might be considered extreme. Why don't we instead consider a point estimate for one of these and keep the other nice and nuanced. In EM the parameter θ is established as the one being unworthy of a full distribution, and set to its MAP (Maximum A Posteriori) value, θ∗.
θ∗=argmaxθE[logp(θ,z,D)]q(z)q(z)=p(z|θ∗,D)
θ∗∈argmaxlog by exp doesn't change the result, so that is not necessary anymore.
4. Maximization-Expectation
zz∗ for our hidden variables and give the parameters θ the luxury of a full distribution.
z∗=argmaxzE[logp(θ,z,D)]q(θ)q(θ)=p(θ|z∗,D)
If our hidden variables z are indicator variables, we suddenly have a computationally cheap method to perform inference on the number of clusters. This is in other words: model selection (or automatic relevance detection or imagine another fancy name).
5. Iterated conditional modes
Of course, the poster child of approximate inference is to use point estimates for both the parameters θ as well as the observations z.
θ∗=argmaxθp(θ,z∗,D)z∗=argmaxzp(θ∗,z,D)
To see how Maximization-Expectation plays out I highly recommend the article. In my opinion, the strength of this article is however not the application to a k-means alternative, but this lucid and concise exposition of approximation.