Expectation Maximization

Notations: We use $\mathbf{x}$ to denote vector. $\mathbf{X}$ is matrix. $(\cdot)^T$ represents transposition. $\mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{A})$ denotes a Gaussian distribution with mean $\mathbf{a}$, variance $\mathbf{A}$ and argument $\mathbf{x}$. $\mathbb{E}\left\{\cdot\right\}$ refers to expectation operation.

Expectation Maximization and Maximum Likelihood

Case I: The likelihood function has close-form. Maximum likelihood estimator (MLE) is an optimal estimator without prior distribution, which can approach Cramer-Rao lower bound (CRLB). We introduce CRLB in [1]. However, maximum likelihood function may be difficult to calculate as hidden variables, this argument will be introduced in next part of this blog. We use $\mathbf{x}$ to denote estimated signal, and $\mathbf{y}$ to refer to observed signal. When the likelihood function $p(\mathbf{y}|\mathbf{x})$ can be explicitly expressed, we then write the MLE of $\mathbf{x}$ as
\begin{align}
\hat{\mathbf{x} }_{\text{ML} }
&=\underset{\mathbf{x} }{\arg \max}\ p(\mathbf{y}|\mathbf{x})\\
&=\underset{\mathbf{x} }{\arg \max}\ \log p(\mathbf{y}|\mathbf{x})
\end{align}
For some simple situations such as following example, using derivative tools we can obtain the $\hat{\mathbf{x} }_{\text{ML} }$.

Example: We consider system as follow
\begin{align}
\mathbf{y}=\mathbf{Hx}+\mathbf{w}
\end{align}
where $\mathbf{x}\in \mathbb{R}^N$ is estimated signal, while $\mathbf{y}\in \mathbb{R}^M$ represents observed signal or received signal. The linear transformation matrix $\mathbf{H}\in \mathbb{R}^{M\times N}$ is given. In addition, the noise $\mathbf{w}\sim \mathcal{N}(\mathbf{w}|\mathbf{0},\sigma^2\mathbf{I})$. The likelihood function of this model is given by
\begin{align}
p(\mathbf{y}|\mathbf{x})=\mathcal{N}(\mathbf{y}|\mathbf{H}\mathbf{x},\sigma^2\mathbf{I})
\end{align}
Then the MLE of $\mathbf{x}$ represents as
\begin{align}
\hat{\mathbf{x} }_{\text{ML} }
&=\underset{\mathbf{x} }{\arg \max}\ \log p(\mathbf{y}|\mathbf{x})\\
&\overset{(a)}{=}\underset{\mathbf{x} }{\arg \min}\ ||\mathbf{y}-\mathbf{Hx}||^2\\
&=(\mathbf{H}^T\mathbf{H})^{-1}\mathbf{H}^T\mathbf{y}
\end{align}
the step $(a)$ shows that the ML is equal to least square (LS) [2] in this case .

Case II: The likelihood function doesn’t have close-form thanks to hidden variables. We use $\boldsymbol{\xi}$ to denote the hidden variables. Then the logarithm-likelihood function is written as
\begin{align}
\log p(\mathbf{y}|\mathbf{x})=\log \int p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})\text{d}\boldsymbol{\xi}
\end{align}
Actually, it is very difficlut to compute because of logarithm-int operation. Thus the MLE of $\mathbf{x}$ cann’t be obtained absolutely. An alternative method is to use expectation maximization (EM) to approximate the ML soultion, a kind of majorization-minimization (MM) [3]. Now, we describe the derivation of EM algorithm step-by-step. The detailed derivation of EM is also found in [4] and [Chapter 9, 5].

$\underline{\text{Step }1}$: Given a postulated distribution $\hat{q}(\boldsymbol{\xi})$ to approximate $p(\boldsymbol{\xi})$. We then have
\begin{align}
\log \int_{\boldsymbol{\xi} } p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})
&=\log \int_{\boldsymbol{\xi} }\hat{q}(\boldsymbol{\xi})\frac{p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})}{\hat{q}(\boldsymbol{\xi})}\text{d}\boldsymbol{\xi}\\
&=\log \left(\mathbb{E}_{\hat{q}(\boldsymbol{\xi})}\left\{\frac{p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})}{\hat{q}(\boldsymbol{\xi})}\right\}\right)
\end{align}

$\underline{\text{Step } 2}$: Find lower bound of $\log p(\mathbf{y}|\boldsymbol{\xi}|\mathbf{x})$ using Jensen’s Inequality[11]. With Jensen’s inequality, the last equation can written as
\begin{align}
\log \int_{\boldsymbol{\xi} } p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})
&\geq \mathbb{E}_{\hat{q}(\boldsymbol{\xi})} \left\{\log \frac{p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})}{\hat{q}(\boldsymbol{\xi})}\right\}\\
&=\mathbb{E}_{\hat{q}(\boldsymbol{\xi})} \left\{\log p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})\right\}-\mathbb{E}_{\hat{q}(\boldsymbol{\xi})}\left\{ {\log \hat{q}(\boldsymbol{\xi})}\right\}
\end{align}
Generally, the distribution $\hat{q}(\boldsymbol{\xi})$ doesn’t depend on $\mathbf{x}$, thus, we only need to maximize the first term
\begin{align}
\text{M-Step:}\quad \hat{\mathbf{x} }=\underset{\mathbf{x} }{\arg \max} \ \mathbb{E}_{\hat{q}(\boldsymbol{\xi})}\left\{\log p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})\right\}
\end{align}
This is named M-step in EM algorithm.

$\underline{\text{Step } 3}$: Find $\hat{q}(\boldsymbol{\xi})$ to minimize Kullback-Leibler divergence (KL) [6], also named relative entropy. As mentioned in step 2,
\begin{align}
\mathbb{E}_{\hat{q}(\boldsymbol{\xi})} \left\{\log \frac{p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})}{\hat{q}(\boldsymbol{\xi})}\right\}
&=\mathbb{E}_{\hat{q}(\boldsymbol{\xi})} \left\{\log \frac{p(\mathbf{y}|\mathbf{x})p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x})}{\hat{q}(\boldsymbol{\xi})}\right\}\\
&=\mathbb{E}_{\hat{q}(\boldsymbol{\xi})} \left\{\log p(\mathbf{y}|\mathbf{x})\right\}
+\mathbb{E}_{\hat{q}(\boldsymbol{\xi})}\left\{\log \frac{p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x})}{\hat{q}(\boldsymbol{\xi})}\right\}\\
&=\log p(\mathbf{y}|\mathbf{x})-\mathbb{E}_{\hat{q}(\boldsymbol{\xi})}\left\{\log \frac{\hat{q}(\boldsymbol{\xi})}{p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x})}\right\}\\
&\overset{(b)}{=}\log p(\mathbf{y}|\mathbf{x})-\text{KL}(\hat{q}(\boldsymbol{\xi})|p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x}))\\
\end{align}
in $(b)$, the maximization of $\mathbb{E}_{\hat{q}(\boldsymbol{\xi})} \left\{\log \frac{p(\mathbf{y},\boldsymbol{\xi}|\mathbf{x})}{\hat{q}(\boldsymbol{\xi})}\right\}$ can be obtained by minimizing $\text{KL}(\hat{q}(\boldsymbol{\xi})|p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x}))$. Thus
\begin{align}
\text{E-Step:}\quad \hat{q}(\boldsymbol{\xi})=p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x})
\end{align}
It means that we use the posterior distribution of $p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x})$ to replace the postulated distribution $\hat{q}(\boldsymbol{\xi})$.

The EM algorithm is summarized as following


For $t=1,\cdots,T$
\begin{align}
\text{E-Step:}& \quad \hat{q}^{(t)}(\boldsymbol{\xi})=p(\boldsymbol{\xi}|\mathbf{y},\mathbf{x}^{(t-1)})\\
\text{M-Step:}& \quad \mathbf{x}^{(t)}=\underset{\mathbf{x} }{\arg \max} \ \mathbb{E}_{\hat{q}^{(t)}(\boldsymbol{\xi})}\left\{\log p(\boldsymbol{y},\boldsymbol{\xi}|\mathbf{x})\right\}
\end{align}
end


EM based Channel Estimator

System Model

We consider following channel model
\begin{align}
\mathbf{Y}=\mathbf{HS}+\mathbf{W}
\end{align}
where $\mathbf{Y}\in \mathbb{R}^{M\times T}$ is a received signal while $\mathbf{S}\in \mathbb{R}^{N\times T}$ is pilot signal with $T$ being the length of pilot. The channel $\mathbf{H}\in \mathbb{R}^{M\times N}$ is estimated and the noise $\mathbf{W}$ is additive Gaussian. Using $\tilde{\mathbf{Y} }$, $\tilde{\mathbf{S} }$, $\tilde{\mathbf{H} }$ and $\tilde{\mathbf{W} }$ to represent $\mathbf{Y}^T$,$\mathbf{S}^T$,$\mathbf{H}^T$, and $\mathbf{W}^T$, respectively yields
\begin{align}
\tilde{\mathbf{Y} }=\tilde{\mathbf{S} }\tilde{\mathbf{H} }+\tilde{\mathbf{W} }
\end{align}
The channel $\tilde{\mathbf{H} }$ is estimated column-by-column. So the model is also written as
\begin{align}
\tilde{\mathbf{y} }_m=\tilde{\mathbf{S} }\tilde{\mathbf{h} }_m+\tilde{\mathbf{w} }_m, \ m=1,\cdots,M
\end{align}
Obviously, each column $\tilde{\mathbf{h} }_m$ have the same method for estimation. As a result, we generally omit subscript and tilde and get following model
\begin{align}
\mathbf{y}=\mathbf{S}\mathbf{h}+\mathbf{w}
\end{align}
where $\mathbf{y}\in \mathbb{R}^{T\times 1}$, $\mathbf{S}\in \mathbb{R}^{T\times N}$ and $\mathbf{h}\in \mathbb{R}^{N\times 1}$. In addition, we generally assume that the noise $\mathbf{w}$ is Gassuian with $\mathbf{w}\sim \mathcal{N}\left(\mathbf{w}|\mathbf{0},\triangle \mathbf{I}\right)$. In [7], the variance of noise $\triangle$ is unknown.

EM based Channel Estimator

In [7], the authors take following Gaussian mixture prior thanks to pilot pollution
\begin{align}
p(\mathbf{h};\boldsymbol{\lambda},\boldsymbol{\sigma})=\sum_{n=1}^N \lambda_n\mathcal{N}\left(h_{n}|0,\sigma_n^2\right)
\end{align}
Note that the parameters $\boldsymbol{\lambda}=\left\{\lambda_n\right\}_{n=1}^N$ and $\boldsymbol{\sigma}=\left\{\sigma_n\right\}_{n=1}^N$ are unknown, as well as $\triangle$. So, we cann’t directly use Bayesian estimator to estimate $\mathbf{h}$. A simple estimator is least square estimator (LS) $\hat{\mathbf{h} }_{\text{LS} }=(\mathbf{S}^T\mathbf{S})^{-1}\mathbf{S}^T\mathbf{y}$. However the performance of LS is easily affected by noise. An alternaive method is maximum likelihood estimator (MLE). Even the variance $\triangle$ is unkonwn, we can use EM algorithm to approximate MLE.

As we mentioned in section I, the EM algorithm includes two parts: E-step and M-step, written as
\begin{align}
\text{E-Step:}\quad &\hat{q}(\mathbf{h})=p(\mathbf{h}|\mathbf{y},\boldsymbol{\eta},\triangle)\\
\text{M-Step:}\quad &\boldsymbol{\eta}^{(t+1)}=\underset{\boldsymbol{\eta} }{\arg\max} \ \mathbb{E}\left\{\log p(\mathbf{y},\mathbf{h};\boldsymbol{\eta}^{(t)},\triangle^{(t)})\right\}\\
&\triangle^{(t+1)}=\underset{\triangle}{\arg\max} \ \mathbb{E}\left\{\log p(\mathbf{y},\mathbf{h};\boldsymbol{\eta}^{(t)},\triangle^{(t)})\right\}
\end{align}
Thanks to sperable structure, it also written as, for $n=1,\cdots, N$
\begin{align}
\text{E-Step:}\quad &\hat{q}(h_n)=p(h_n|\mathbf{y},\boldsymbol{\eta},\triangle)\\
\text{M-Step:}\quad &\eta_n^{(t+1)}=\underset{\eta_n}{\arg\max} \ \mathbb{E}\left\{\log p(\mathbf{y},\mathbf{h};\boldsymbol{\eta}^{(t)},\triangle^{(t)})\right\}\\
&\triangle^{(t+1)}=\underset{\triangle}{\arg\max} \ \mathbb{E}\left\{\log p(\mathbf{y},\mathbf{h};\boldsymbol{\eta}^{(t)},\triangle^{(t)})\right\}
\end{align}
Actually, the the posterior distribution in E-step is exremly difficult to calculate. Fortunatly, message passing [Chapter 2, 8] based on factor graph is a high-efficiency algortihm for marginal distribution. Inspired by message passing and ignoring high-order iterms, approximate message passing (AMP) [9] is proposed for marginalization of joint distribution. A concise derivation is shown in [10].

In [7], the authors use AMP to carry out E-step. The advantages of using AMP to finish E-step in EM includes two points. One is that AMP is low complexity and another is that AMP can approach Bayesian optimum as system being large.

So the EM-based channel estimator is shown as
$\underline{\text{Step }1}$: (E-Step) Using AMP to calculate marginal distribution of $h_n$.
$\underline{\text{Step }2}$:
\begin{align}
\text{M-Step:}\quad &\eta_n^{(t+1)}=\underset{\eta_n}{\arg\max} \ \mathbb{E}\left\{\log p(\mathbf{y},\mathbf{h};\boldsymbol{\eta}^{(t)},\triangle^{(t)})\right\}\\
&\triangle^{(t+1)}=\underset{\triangle}{\arg\max} \ \mathbb{E}\left\{\log p(\mathbf{y},\mathbf{h};\boldsymbol{\eta}^{(t)},\triangle^{(t)})\right\}
\end{align}

The details of those two steps can be found in [7].

References

[1] https://www.qiuyun-blog.cn/2018/10/28/%E5%85%8B%E6%8B%89%E7%BE%8E-%E7%BD%97%E4%B8%8B%E7%95%8C/
[2] Kay S , 罗鹏飞. 统计信号处理基础[M]. 电子工业出版社, 2014.
[3] Sun Y, Babu P, Palomar D P. Majorization-minimization algorithms in signal processing, communications, and machine learning[J]. IEEE Transactions on Signal Processing, 2017, 65(3): 794-816.
[4] https://people.csail.mit.edu/rameshvs/content/gmm-em.pdf
[5] Nasrabadi N M. Pattern recognition and machine learning[J]. Journal of electronic imaging, 2007, 16(4): 049901.
[6] https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
[7] Wen C K, Jin S, Wong K K, et al. Channel estimation for massive MIMO using Gaussian-mixture Bayesian learning[J]. IEEE Transactions on Wireless Communications, 2015, 14(3): 1356-1368.
[8] Richardson T, Urbanke R. Modern coding theory[M]. Cambridge university press, 2008.
[9] Donoho D L, Maleki A, Montanari A. Message-passing algorithms for compressed sensing[J]. Proceedings of the National Academy of Sciences, 2009, 106(45): 18914-18919.
[10] Meng X, Wu S, Kuang L, et al. An expectation propagation perspective on approximate message passing[J]. IEEE Signal Processing Letters, 2015, 22(8): 1194-1197.
[11] https://en.wikipedia.org/wiki/Jensen%27s_inequality