Как Factor Analysis объясняет ковариацию, в то время как PCA объясняет дисперсию?

Вот цитата из книги Бишопа «Распознавание образов и машинное обучение», раздел 12.2.4 «Факторный анализ»:

enter image description here

В соответствии с выделенной части, факторный анализ фиксирует ковариации между переменными в матрице $W$ . Интересно , КАК ?

Вот как я это понимаю. Скажем, $x$ - наблюдаемая $p$ мерная переменная, $W$ - матрица факторной нагрузки, а $z$ - вектор коэффициента. Тогда мы имеем

x = μ + W z + ϵ,

$x=\mu+Wz+\epsilon,$ то есть

\begin{aligned} (\begin{matrix} x_{1} \\ ⋮ \\ x_{p} \end{matrix}) = (\begin{matrix} μ_{1} \\ ⋮ \\ μ_{p} \end{matrix}) + (\begin{matrix} | & | \\ w_{1} & \dots & w_{m} \\ | & | \end{matrix}) (\begin{matrix} z_{1} \\ ⋮ \\ z_{m} \end{matrix}) + ϵ, \end{aligned}

$\begin{align*} \begin{pmatrix} x_1\\ \vdots\\ x_p \end{pmatrix} = \begin{pmatrix} \mu_1\\ \vdots\\ \mu_p \end{pmatrix} + \begin{pmatrix} \vert & & \vert\\ w_1 & \ldots & w_m\\ \vert & & \vert \end{pmatrix} \begin{pmatrix} z_1\\ \vdots\\ z_m \end{pmatrix} +\epsilon, \end{align*}$ and each column in

W

$W$ is a factor loading vector

w_{i} = (\begin{matrix} w_{i 1} \\ ⋮ \\ w_{i p} \end{matrix}) .

$w_i=\begin{pmatrix}w_{i1}\\ \vdots\\ w_{ip}\end{pmatrix}.$ Here as I wrote,

W

$W$ has

m

$m$ columns meaning there are

m

$m$ factors under consideration.

Now here is the point, according to the highlighted part, I think the loadings in each column $w_i$ explain the covariance in the observed data, right?

For example, let's take a look at the first loading vector $w_1$ , for $1\le i,j,k\le p$ , if $w_{1i}=10$ , $w_{1j}=11$ and $w_{1k}=0.1$ , then I'd say $x_i$ and $x_j$ are highly correlated, whereas $x_k$ seems uncorrelated with them, am I right?

And if this is how factor analysis explain the covariance between observed features, then I'd say PCA also explains the covariance, right?

pca factor-analysis geometry

— avocado
источник

As @ttnphns's plot refers to subject space representation, here is one tutorial about variable space and subject space: BTW, I didn't know about subject space plot before, now I understand it and here is one tutorial about it:amstat.org/publications/jse/v10n1/yu/biplot.html. ;-)

— avocado

I'd remark as well that loading plot which shows loadings is actually subject space. Showing both variable and subject spaces in one is biplot. Some pictures demonstrating it stats.stackexchange.com/a/50610/3277.

— ttnphns

Here is a question about what is "common variance" and "shared variance" terminologically: stats.stackexchange.com/q/208175/3277.

— ttnphns

Ответы:

The distinction between Principal component analysis and Factor analysis is discussed in numerous textbooks and articles on multivariate techniques. You may find the full thread, and a newer one, and odd answers, on this site, too.

I'm not going to make it detailed. I've already given a concise answer and a longer one and would like now to clarify it with a pair of pictures.

Graphical representation

The picture below explains PCA. (This was borrowed from here where PCA is compared with Linear regression and Canonical correlations. The picture is the vector representation of variables in the subject space; to understand what it is you may want to read the 2nd paragraph there.)

enter image description here

PCA configuration on this picture was described there. I will repeat most principal things. Principal components $P_1$ and $P_2$ lie in the same space that is spanned by the variables $X_1$ and $X_2$ , "plane X". Squared length of each of the four vectors is its variance. The covariance between $X_1$ and $X_2$ is $cov_{12}= |X_1||X_2|r$ , where $r$ equals the cosine of the angle between their vectors.

The projections (coordinates) of the variables on the components, the $a$ 's, are the loadings of the components on the variables: loadings are the regression coefficients in the linear combinations of modeling variables by standardized components. "Standardized" - because information about components' variances is already absorbed in loadings (remember, loadings are eigenvectors normalized to the respective eigenvalues). And due to that, and to the fact that components are uncorrelated, loadings are the covariances between the variables and the components.

Using PCA for dimensionality/data reduction aim compels us to retain only $P_1$ and to regard $P_2$ as the remainder, or error. $a_{11}^2+a_{21}^2= |P_1|^2$ is the variance captured (explained) by $P_1$ .

The picture below demonstrates Factor analysis performed on the same variables $X_1$ and $X_2$ with which we did PCA above. (I will speak of common factor model, for there exist other: alpha factor model, image factor model.) Smiley sun helps with lighting.

The common factor is $F$ . It is what is the analogue to the main component $P_1$ above. Can you see the difference between these two? Yes, clearly: the factor does not lie in the variables' space "plane X".

How to get that factor with one finger, i.e. to do factor analysis? Let's try. On the previous picture, hook the end of $P_1$ arrow by your nail tip and pull away from "plane X", while visualizing how two new planes appear, "plane U1" and "plane U2"; these connecting the hooked vector and the two variable vectors. The two planes form a hood, X1 - F - X2, above "plane X".

enter image description here

Continue to pull while contemplating the hood and stop when "plane U1" and "plane U2" form 90 degrees between them. Ready, factor analysis is done. Well, yes, but not yet optimally. To do it right, like packages do, repeat the whole excercise of pulling the arrow, now adding small left-right swings of your finger while you pull. Doing so, find the position of the arrow when the sum of squared projections of both variables onto it is maximized, while you attain to that 90 degree angle. Stop. You did factor analysis, found the position of the common factor $F$ .

Again to remark, unlike principal component $P_1$ , factor $F$ does not belong to variables' space "plane X". It therefore is not a function of the variables (principal component is, and you can make sure from the two top pictures here that PCA is fundamentally two-directional: predicts variables by components and vice versa). Factor analysis is thus not a description/simplification method, like PCA, it is modeling method whereby latent factor steeres observed variables, one-directionally.

Loadings $a$ 's of the factor on the variables are like loadings in PCA; they are the covariances and they are the coefficients of modeling variables by the (standardized) factor. $a_{1}^2+a_{2}^2= |F|^2$ is the variance captured (explained) by $F$ . The factor was found as to maximize this quantity - as if a principal component. However, that explained variance is no more variables' gross variance, - instead, it is their variance by which they co-vary (correlate). Why so?

Get back to the pic. We extracted $F$ under two requirements. One was the just mentioned maximized sum of squared loadings. The other was the creation of the two perpendicular planes, "plane U1" containing $F$ and $X_1$ , and "plane U2" containing $F$ and $X_2$ . This way each of the X variables appeared decomposed. $X_1$ was decomposed into variables $F$ and $U_1$ , mutually orthogonal; $X_2$ was likewise decomposed into variables $F$ and $U_2$ , also orthogonal. And $U_1$ is orthogonal to $U_2$ . We know what is $F$ - the common factor. $U$ 's are called unique factors. Each variable has its unique factor. The meaning is as follows. $U_1$ behind $X_1$ and $U_2$ behind $X_2$ are the forces that hinder $X_1$ and $X_2$ to correlate. But $F$ - the common factor - is the force behind both $X_1$ and $X_2$ that makes them to correlate. And the variance being explained lie along that common factor. So, it is pure collinearity variance. It is that variance that makes $cov_{12}>0$ ; the actual value of $cov_{12}$ being determined by inclinations of the variables towards the factor, by $a$ 's.

A variable's variance (vector's length squared) thus consists of two additive disjoint parts: uniqueness $u^2$ and communality $a^2$ . With two variables, like our example, we can extract at most one common factor, so communality = single loading squared. With many variables we might extract several common factors, and a variable's communality will be the sum of its squared loadings. On our picture, the common factors space is unidimensional (just $F$ itself); when m common factors exist, that space is m-dimensional, with communalities being variables' projections on the space and loadings being variables' as well as those projections' projections on the factors that span the space. Variance explained in factor analysis is the variance within that common factors' space, different from variables' space in which components explain variance. The space of the variables is in the belly of the combined space: m common + p unique factors.

Just glance at the current pic please. There were several (say, $X_1$ , $X_2$ , $X_3$ ) variables with which factor analysis was done, extracting two common factors. The factors $F_1$ and $F_2$ span the common factor space "factor plane". Of the bunch of analysed variables only one ( $X_1$ ) is shown on the figure. The analysis decomposed it in two orthogonal parts, communality $C_1$ and unique factor $U_1$ . Communality lies in the "factor plane" and its coordinates on the factors are the loadings by which the common factors load $X_1$ (= coordinates of $X_1$ itself on the factors). On the picture, communalities of the other two variables - projections of $X_2$ and of $X_3$ - are also displayed. It would be interesting to remark that the two common factors can, in a sense, be seen as the principal components of all those communality "variables". Whereas usual principal components summarize by seniority the multivariate total variance of the variables, the factors summarize likewise their multivariate common variance. $^1$

Why needed all that verbiage? I just wanted to give evidence to the claim that when you decompose each of the correlated variables into two orthogonal latent parts, one (A) representing uncorrelatedness (orthogonality) between the variables and the other part (B) representing their correlatedness (collinearity), and you extract factors from the combined B's only, you will find yourself explaining pairwise covariances, by those factors' loadings. In our factor model, $cov_{12} \approx a_1a_2$ - factors restore individual covariances by means of loadings. In PCA model, it is not so since PCA explains undecomposed, mixed collinear+orthogonal native variance. Both strong components that you retain and subsequent ones that you drop are fusions of (A) and (B) parts; hence PCA can tap, by its loadings, covariances only blindly and grossly.

Contrast list PCA vs FA

PCA: operates in the space of the variables. FA: trancsends the space of the variables.
PCA: takes variability as is. FA: segments variability into common and unique parts.
PCA: explains nonsegmented variance, i.e. trace of the covariance matrix. FA: explains common variance only, hence explains (restores by loadings) correlations/covariances, off-diagonal elements of the matrix. (PCA explains off-diagonal elements too - but in passing, offhand manner - simply because variances are shared in a form of covariances.)
PCA: components are theoretically linear functions of variables, variables are theoretically linear functions of components. FA: variables are theoretically linear functions of factors, only.
PCA: empirical summarizing method; it retains m components. FA: theoretical modeling method; it fits fixed number m factors to the data; FA can be tested (Confirmatory FA).
PCA: is simplest metric MDS, aims to reduce dimensionality while indirectly preserving distances between data points as much as possible. FA: Factors are essential latent traits behind variables which make them to correlate; the analysis aims to reduce data to those essences only.
PCA: rotation/interpretation of components - sometimes (PCA is not enough realistic as a latent-traits model). FA: rotation/interpretation of factors - routinely.
PCA: data reduction method only. FA: also a method to find clusters of coherent variables (this is because variables cannot correlate beyond a factor).
PCA: loadings and scores are independent of the number m of components "extracted". FA: loadings and scores depend on the number m of factors "extracted".
PCA: component scores are exact component values. FA: factor scores are approximates to true factor values, and several computational methods exist. Factor scores do lie in the space of the variables (like components do) while true factors (as embodied by factor loadings) do not.
PCA: usually no assumptions. FA: assumption of weak partial correlations; sometimes multivariate normality assumption; some datasets may be "bad" for analysis unless transformed.
PCA: noniterative algorithm; always successful. FA: iterative algorithm (typically); sometimes nonconvergence problem; singularity may be a problem.

$^1$ For meticulous. One might ask where are variables $X_2$ and $X_3$ themselves on the pic, why were they not drawn? The answer is that we can't draw them, even theoretically. The space on the picture is 3d (defined by "factor plane" and the unique vector $U_1$ ; $X_1$ lying on their mutual complement, plane shaded grey, that's what corresponds to one slope of the "hood" on the picture No.2), and so our graphic resources are exhausted. The three dimensional space spanned by three variables $X_1$ , $X_2$ , $X_3$ together is another space. Neither "factor plane" nor $U_1$ are the subspaces of it. It's what is different from PCA: factors do not belong to the variables' space. Each variable separately lies in its separate grey plane orthogonal to "factor plane" - just like $X_1$ shown on our pic, and that is all: if we were to add, say, $X_2$ to the plot we should have invented 4th dimension. (Just recall that all $U$ s have to be mutually orthogonal; so, to add another $U$ , you must expand dimensionality farther.)

Similarly as in regression the coefficients are the coordinates, on the predictors, both of the dependent variable(s) and of the prediction(s) (See pic under "Multiple Regression", and here, too), in FA loadings are the coordinates, on the factors, both of the observed variables and of their latent parts - the communalities. And exactly as in regression that fact did not make the dependent(s) and the predictors be subspaces of each other, - in FA the similar fact does not make the observed variables and the latent factors be subspaces of each other. A factor is "alien" to a variable in a quite similar sense as a predictor is "alien" to a dependent response. But in PCA, it is other way: principal components are derived from the observed variables and are confined to their space.

So, once again to repeat: m common factors of FA are not a subspace of the p input variables. On the contrary: the variables form a subspace in the m+p (m common factors + p unique factors) union hyperspace. When seen from this perspective (i.e. with the unique factors attracted too) it becomes clear that classic FA is not a dimensionality shrinkage technique, like classic PCA, but is a dimensionality expansion technique. Nevertheless, we give our attention only to a small (m dimensional common) part of that bloat, since this part solely explains correlations.

— ttnphns
источник

Thanks, and nice plot. Your answer (stats.stackexchange.com/a/94104/30540) helps a lot.

— avocado

(+11) Great answer and nice illustrations! (I have to wait two more days before offering the bounty.)

— chl

@chl, I'm so moved.

— ttnphns

@ttnphns: The "subject space" (your plane X) is a space with as many coordinates as there are data points in the dataset, right? So if a dataset (with two variables X1 and X2) has 100 data points, then your plane X is 100-dimensional? But then how can the factor F lie outside of it? Shouldn't all 100 data points have some values along the factor? And as there are no other data points, it would seem that the factor F has to lie in the same 100-dimensional "subject space", i.e. in the plane X? What am I missing?

— amoeba says Reinstate Monica

@amoeba, your question is legitimate and yes, you are missing a thing. See the 1st paragraph: stats.stackexchange.com/a/51471/3277. Redundant dimensions are dropped. Subject space has as many actual, non-redundent dimensions as the corresponding variable space has. So "space X" is plane. If we add +1 dimension (to cover F), the whole configuration will be singular, unsolvable. F always extends out of variable space.

— ttnphns

"Explaining covariance" vs. explaining variance

Bishop actually means a very simple thing. Under the factor analysis model (eq. 12.64)

p (x | z) = N (x | W z + μ, Ψ)

$p(\mathbf x|\mathbf z) = \mathcal N(\mathbf x | \mathbf W \mathbf z + \boldsymbol \mu, \boldsymbol \Psi)$ the covariance matrix of

x

$\mathbf x$ is going to be (eq. 12.65)

C = W W^{⊤} + Ψ .

$\mathbf C = \mathbf W \mathbf W^\top + \boldsymbol \Psi.$ This is essentially what factor analysis does: it finds a matrix of loadings and a diagonal matrix of uniquenesses such that the actually observed covariance matrix

Σ

$\boldsymbol \Sigma$ is as well as possible approximated by

C

$\mathbf C$ :

Σ \approx W W^{⊤} + Ψ .

$\boldsymbol \Sigma \approx \mathbf W \mathbf W^\top + \boldsymbol \Psi.$ Notice that diagonal elements of

C

$\mathbf C$ will be exactly equal to the diagonal elements of

Σ

$\boldsymbol \Sigma$ because we can always choose the diagonal matrix

Ψ

$\boldsymbol \Psi$ such that the reconstruction error on the diagonal is zero. The real challenge is then to find loadings

W

$\mathbf W$ that would well approximate the off-diagonal part of

Σ

$\boldsymbol \Sigma$ .

The off-diagonal part of $\boldsymbol \Sigma$ consists of covariances between variables; hence Bishop's claim that factor loadings are capturing the covariances. The important bit here is that factor loadings do not care at all about individual variances (diagonal of $\boldsymbol \Sigma$ ).

In contrast, PCA loadings $\widetilde {\mathbf W}$ are eigenvectors of the covariance matrix $\boldsymbol \Sigma$ scaled up by square roots of their eigenvalues. If only $m<k$ principal components are chosen, then

Σ \approx \tilde{W} {\tilde{W}}^{⊤},

$\boldsymbol \Sigma \approx \widetilde{\mathbf W} \widetilde{\mathbf W}^\top,$ meaning that PCA loadings try to reproduce the whole covariance matrix (and not only its off-diagonal part as FA). This is the main difference between PCA and FA.

Further comments

I love the drawings in @ttnphns'es answer (+1), but I would like to stress that they deal with a very special situation of two variables. If there are only two variables under consideration, the covariance matrix is $2 \times 2$ , has only one off-diagonal element and so one factor is always enough to reproduce it 100% (whereas PCA would need two components). However in general, if there are many variables (say, a dozen or more) then neither PCA nor FA with small number of components will be able to fully reproduce the covariance matrix; moreover, they will usually (even though not necessarily!) produce similar results. See my answer here for some simulations supporting this claim and for further explanations:

Is there any good reason to use PCA instead of EFA? Also, can PCA be a substitute for factor analysis?

So even though @ttnphns's drawings can make the impression that PCA and FA are very different, my opinion is that it is not the case, except with very few variables or in some other special situations.