Какой будет надежная байесовская модель для оценки масштаба примерно нормального распределения?


32

Существует ряд надежных оценок масштаба . Ярким примером является медианой абсолютное отклонение , которое относится к стандартному отклонению , как σ=MAD1.4826 . В байесовской структуре существует ряд способов надежной оценки местоположения примерно нормального распределения (скажем, нормального, загрязненного выбросами), например, можно предположить, что данные распределены как при распределении, так и при распределении Лапласа. Теперь мой вопрос:

Какова будет байесовская модель для измерения масштаба приблизительно нормального распределения надежным способом, устойчивым в том же смысле, что и MAD или аналогичные надежные оценки?

Как и в случае с MAD, было бы неплохо, если бы байесовская модель могла приблизиться к SD нормального распределения в случае, когда распределение данных фактически распределено нормально.

редактировать 1:

Типичный пример модели , которая устойчива к загрязнению / выбросам при предположении данных yi являюсь примерно нормально используем при распределении , как:

yit(m,s,ν)

Где m - среднее значение, s - масштаб, а ν - степень свободы. С помощью соответствующих априорий на m,s и ν , , m будет оценкой среднего yi , что будет устойчивыми к выбросам. Тем не менее, s не будет состоятельной оценкой СД yi , как s зависит от ν , . Например, если ν будет зафиксировано на 4.0, а модель, приведенная выше, будет соответствовать огромному количеству выборок из распределение тогда s будет около 0,82. То, что я ищу, - это модель, которая является надежной, как t-модель, но для SD вместо (или в дополнение к) среднего значения.Norm(μ=0,σ=1)s

редактировать 2:

Здесь следует закодированный пример в R и JAGS того, как упомянутая выше t-модель является более устойчивой по отношению к среднему.

# generating some contaminated data
y <- c( rnorm(100, mean=10, sd=10), 
        rnorm(10, mean=100, sd= 100))

#### A "standard" normal model ####
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dnorm(mu, inv_sigma2)
  }

  mu ~ dnorm(0, 0.00001)
  inv_sigma2 ~ dgamma(0.0001, 0.0001)
  sigma <- 1 / sqrt(inv_sigma2)
}"

model <- jags.model(textConnection(model_string), list(y = y))
mcmc_samples <- coda.samples(model, "mu", n.iter=10000)
summary(mcmc_samples)

### The quantiles of the posterior of mu
##  2.5%   25%   50%   75% 97.5% 
##   9.8  14.3  16.8  19.2  24.1 

#### A (more) robust t-model ####
library(rjags)
model_string <- "model{
  for(i in 1:length(y)) {
    y[i] ~ dt(mu, inv_s2, nu)
  }

  mu ~ dnorm(0, 0.00001)
  inv_s2 ~ dgamma(0.0001,0.0001)
  s <- 1 / sqrt(inv_s2)
  nu ~ dexp(1/30) 
}"

model <- jags.model(textConnection(model_string), list(y = y))
mcmc_samples <- coda.samples(model, "mu", n.iter=1000)
summary(mcmc_samples)

### The quantiles of the posterior of mu
## 2.5%   25%   50%   75% 97.5% 
##8.03  9.35  9.99 10.71 12.14 

Может быть, это не достаточно надежно, но распределение хи-квадрат - это обычно выбираемое сопряженное значение для инверсии дисперсии.
Майк Данлавей

Возможно, вы захотите узнать, достаточно ли вам первого ответа на этот вопрос stats.stackexchange.com/questions/6493/… ; это может быть не так, но, возможно, это так.
jbowman

Какой у вас приоритет по уровню загрязнения? Будет ли загрязнение систематическим? Случайные? Будет ли он генерироваться одним или несколькими дистрибутивами? Знаем ли мы что-нибудь о распределении шума? Если хотя бы некоторые из вышеперечисленных вещей известны, то мы могли бы подобрать какую-то смешанную модель. В противном случае, я не уверен, каково ваше мнение об этой проблеме на самом деле, и если у вас его нет, это кажется очень расплывчатым. Вам нужно что-то исправить, в противном случае вы можете случайным образом выбрать точку и объявить ее единственной сгенерированной точкой с гауссианой.
средство к значению

But in general, you could either fit a t-distribution which is more resistant against outliers, or a mixture of t-distributions. I'm sure there are many papers, here is one by Bishop research.microsoft.com/en-us/um/people/cmbishop/downloads/… and here is an R-package to fit mixtures: maths.uq.edu.au/~gjm/mix_soft/EMMIX_R/EMMIX-manual.pdf
means-to-meaning

1
Your σ=MAD1.4826 is true for a normally distributed population, but not for most other distributions
Henry

Ответы:


10

Bayesian inference in a T noise model with an appropriate prior will give a robust estimate of location and scale. The precise conditions that the likelihood and prior need to satisfy are given in the paper Bayesian robustness modelling of location and scale parameters by Andrade and O'Hagan (2011). The estimates are robust in the sense that a single observation cannot make the estimates arbitrarily large, as demonstrated in figure 2 of the paper.

When the data is normally distributed, the SD of the fitted T distribution (for fixed ν) does not match the SD of the generating distribution. But this is easy to fix. Let σ be the standard deviation of the generating distribution and let s be the standard deviation of the fitted T distribution. If the data is scaled by 2, then from the form of the likelihood we know that s must scale by 2. This implies that s=σf(ν) for some fixed function f. This function can be computed numerically by simulation from a standard normal. Here is the code to do this:

library(stats)
library(stats4)
y = rnorm(100000, mean=0,sd=1)
nu = 4
nLL = function(s) -sum(stats::dt(y/s,nu,log=TRUE)-log(s))
fit = mle(nLL, start=list(s=1), method="Brent", lower=0.5, upper=2)
# the variance of a standard T is nu/(nu-2)
print(coef(fit)*sqrt(nu/(nu-2)))

For example, at ν=4 I get f(ν)=1.18. The desired estimator is then σ^=s/f(ν).


1
Nice answer (+1). 'in the sense that a single observation cannot make the estimates arbitrarily large,' so the breakdown point is 2/n (I was wondering about this)....As a point of comparison, for the procedure illustrated in my answer it is n/2.
user603

Wow, thanks! Fuzzy follow up question. Would it then actually make sense to "correct" the scale so it's consistent with the SD in the Normal case? The use case I'm thinking of is when reporting a measure of spread. I would have no problem with reporting scale, but it would be nice to report something that would be consistent with SD as it is the most common measure of spread (at least in psychology). Do you see a situation where this correction would lead to strange and inconsistent estimates?
Rasmus Bååth

6

As you are asking a question about a very precise problem (robust estimation), I will offer you an equally precise answer. First, however, I will begin be trying to dispel an unwarranted assumption. It is not true that there is a robust bayesian estimate of location (there are bayesian estimators of locations but as I illustrate below they are not robust and, apparently, even the simplest robust estimator of location is not bayesian) . In my opinion, the reasons for the absence of overlap between the 'bayesian' and 'robust' paradigm in the location case goes a long way in explaining why there also are no estimators of scatter that are both robust and bayesian .

With suitable priors on m,s and ν, m will be an estimate of the mean of yi that will be robust against outliers.

Actually, no. The resulting estimates will only be robust in a very weak sense of the word robust. However, when we say that the median is robust to outliers we mean the word robust in a much stronger sense. That is, in robust statistics, the robustness of the median refers to the property that if you compute the median on a data-set of observations drawn from a uni-modal, continuous model and then replace less than half of these observations by arbitrary values, the value of the median computed on the contaminated data is close to the value you would have had had you computed it on the original (uncontaminated) data-set. Then, it is easy to show that the estimation strategy you propose in the paragraph I quoted above is definitely not robust in the sense of how the word is typically understood for the median.

I'm wholly unfamiliar with Bayesian analysis. However, I was wondering what is wrong with the following strategy as it seems simple, effective and yet has not been considered in the other answers. The prior is that the good part of the data is drawn from a symmetric distribution F and that the rate of contamination is less than half. Then, a simple strategy would be to:

  1. compute the median/mad of your dataset. Then compute:
    zi=|ximed(x)|mad(x)
  2. exclude the observations for which zi>qα(z|xF) (this is the α quantile of the distribution of z when xF). This quantity is avalaible for many choice of F and can be bootstrapped for the others.
  3. Run a (usual, non-robust) Bayesian analysis on the non-rejected observations.

EDIT:

Thanks to the OP for providing a self contained R code to conduct a bonna fide bayesian analysis of the problem.

the code below compares the the bayesian approach suggested by the O.P. to it's alternative from the robust statistics literature (e.g. the fitting method proposed by Gauss for the case where the data may contain as much as n/22 outliers and the distribution of the good part of the data is Gaussian).

central part of the data is N(1000,1):

n<-100
set.seed(123)
y<-rnorm(n,1000,1)

Add some amount of contaminants:

y[1:30]<-y[1:30]/100-1000 
w<-rep(0,n)
w[1:30]<-1

the index w takes value 1 for the outliers. I begin with the approach suggested by the O.P.:

library("rjags")
model_string<-"model{
  for(i in 1:length(y)){
    y[i]~dt(mu,inv_s2,nu)
  }
  mu~dnorm(0,0.00001)
  inv_s2~dgamma(0.0001,0.0001)
  s<-1/sqrt(inv_s2)
  nu~dexp(1/30) 
}"

model<-jags.model(textConnection(model_string),list(y=y))
mcmc_samples<-coda.samples(model,"mu",n.iter=1000)
print(summary(mcmc_samples)$statistics[1:2])
summary(mcmc_samples)

I get:

     Mean        SD 
384.2283  97.0445 

and:

2. Quantiles for each variable:

 2.5%   25%   50%   75% 97.5% 
184.6 324.3 384.7 448.4 577.7 

(quiet far thus from the target values)

For the robust method,

z<-abs(y-median(y))/mad(y)
th<-max(abs(rnorm(length(y))))
print(c(mean(y[which(z<=th)]),sd(y[which(z<=th)])))

one gets:

 1000.149 0.8827613

(very close to the target values)

The second result is much closer to the real values. But it gets worst. If we classify as outliers those observations for which the estimated z-score is larger than th (remember that the prior is that F is Gaussian) then the bayesian approach finds that all the observations are outliers (the robust procedure, in contrast, flags all and only the outliers as such). This also implies that if you were to run a usual (non-robust) bayesian analysis on the data not classified as outliers by the robust procedure, you should do fine (e.g. fulfil the objectives stated in your question).
This is just an example, but it's actually fairly straightforward to show that (and it can done formally, see for example, in chapter 2 of [1]) the parameters of a student t distribution fitted to contaminated data cannot be depended upon to reveal the outliers.

  • [1]Ricardo A. Maronna, Douglas R. Martin, Victor J. Yohai (2006). Robust Statistics: Theory and Methods (Wiley Series in Probability and Statistics).
  • Huber, P. J. (1981). Robust Statistics. New York: John Wiley and Sons.

1
Well, the t is often proposed as a robust alternative to the normal distribution. I don't know if this is in the weak sense or not. See for example: Lange, K. L., Little, R. J., & Taylor, J. M. (1989). Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408), 881-896. pdf
Rasmus Bååth

1
This is the weak sense. If you have an R code that implements the procedure you suggest, I ll be happy to illustrate my answer with an example. otherwise you can get more explanation in chapter 2 of this textbook.
user603

The procedure I suggest is basically described here: indiana.edu/~kruschke/BEST including R code. I will have to think about your solution! It does not, however, seem Bayesian in the sense that it does not model all the data, just the subset that "survives" step 2.
Rasmus Bååth

I thank you for your interesting discussion! Your answer is not that I seek, however, because (1) you don't describe a Bayesian procedure, you describe more of a data preparation step for how to remove outliers (2) your procedure does not result in a consistent estimator of SD, that is, if you sample from a normal distribution and the number of datapoints you will not approach the "true" SD, rather your estimate will be a bit low. I also don't completely buy your definition of robust (your definition is not how I have seen it in most Bayesian literature I've come across)
Rasmus Bååth

1
I have now done so!
Rasmus Bååth

1

In bayesian analysis using the inverse Gamma distribution as a prior for the precision (the inverse of the variance) is a common choice. Or the inverse Wishart distribution for multivariate models. Adding a prior on the variance improves robustness against outliers.

There is a nice paper by Andrew Gelman: "Prior distributions for variance parameters in hierarchical models" where he discusses what good choices for the priors on the variances can be.


4
I'm sorry but I fail to see how this answers the question. I did not ask for a robust prior, but rather for a robust model.
Rasmus Bååth

0

A robust estimator for the location parameter μ of some dataset of size N is obtained when one assigns a Jeffreys prior to the variance σ2 of the normal distribution, and computes the marginal for μ, yielding a t distribution with N degrees of freedom.

Similarly, if you want a robust estimator for the standard deviation σ of some data D, we can do the following:

First, we suppose that the data is normally distributed when its mean and standard deviation are known. Therefore,

D|μ,σN(μ,σ2)
and if D(d1,,dN) then
p(D|μ,σ2)=1(2πσ)Nexp(N2σ2((mμ2)+s2))
where the sufficient statistics m and s2 are
m=1Ni=1Ndis2=1Ni=1Ndi2m2
In addition, using Bayes' theorem, we have
p(μ,σ2|D)p(D|μ,σ2)p(μ,σ2)
A convenient prior for (μ,σ2) is the Normal-invese-gamma family, which covers a wide range of shapes and is conjugate to this likelihood. This means that the posterior distribution p(μ,σ2|D) still belongs to the normal-inverse-gamma family, and its marginal p(σ2|D) is an inverse gamma distribution parameterized as
σ2|DIG(α+N/2,2β+Ns2)α,β>0
From this distribution, we can take the mode, which will give us an estimator for σ2. This estimator will be more or less tolerant to small excursions from misspecifications on the model by varying α and/or β. The variance of this distribution will then provide some indication on the fault-tolerance of the estimate. Since the tails of the inverse gamma are semi-heavy, you get the kind of behaviour you would expect from the t distribution estimate for μ that you mention.

1
"A robust estimator for the location parameter μ of some dataset of size N is obtained when one assigns a Jeffreys prior to the variance σ2 of the normal distribution." Isn't this Normal model you describe a typical example of a non-robust model? That is, a single value that is off can have great influence on the parameters of the model. There is a big difference between the posterior over the mean being a t-distribution (as in your case) and the distribution for the data being a t-distribution (as is a common example of a robust Bayesian model for estimating the mean).
Rasmus Bååth

1
It all depends on what you mean by robust. What you are saying right now is that you would like robustness wrt data. What I was proposing was robustness wrt model mis-specification. They are both different types of robustness.
yannick

2
I would say that the examples I gave, MAD and using a t distribution as the distribution for the data are examples of robustness with respect to data.
Rasmus Bååth

I would say Rasmus is right and so would Gelman er al in BDA3, as would a basic understanding that th t distribution has fatter tails than the normal for the same location parameter
Brash Equilibrium

0

I have followed the discussion from the original question. Rasmus when you say robustness I am sure you mean in the data (outliers, not miss-specification of distributions). I will take the distribution of the data to be Laplace distribution instead of a t-distribution, then as in normal regression where we model the mean, here we will model the median (very robust) aka median regression (we all know). Let the model be:

Y=βX+ϵ, ϵ has laplace(0,σ2).

Of course our goal is to estimate model parameters. We expect our priors to be vague to have an objective model. The model at hand has a posterior of the form f(β,σ,Y,X). Giving β a normal prior with large variance makes such a prior vague and a chis-squared prior with small degrees of freedom to mimic a jeffrey's prior(vague prior) is given to to σ2. With a Gibbs sampler what happens? normal prior+laplace likehood=???? we do know. Also chi-square prior +laplace likelihood=??? we do not know the distribution. Fortunately for us there is a theorem in (Aslan,2010) that transforms a laplace likelihood to a scale mixture of normal distributions which then enable us to enjoy the conjugate properties of our priors. I think the whole process described is fully robust in terms of outliers. In a multivariate setting chi-square becomes a a wishart distribution, and we use multivariate laplace and normal distributions.


2
Your solution seems to be focused on robust estimation of the location(mean/median). My question was rather about estimation of scale with the property of consistency with respect to retrieving the SD when the data generating distribution actually is normal.
Rasmus Bååth

With a robust estimate of the location, the scale as function of the location immediately benefits from the robustness of the location. There is no other way of making the scale robust.
Chamberlain Foncha

Anyway I must say I am eagerly waiting to see how this problem will be tackled most especially with a normal distribution as you emphasized.
Chamberlain Foncha

0

Suppose that you have K groups and you want to model the distribution of their sample variances, perhaps in relation to some covariates x. That is, suppose that your data point for group k1K is Var(yk)[0,). The question here is, "What is a robust model for the likelihood of the sample variance?" One way to approach this is to model the transformed data ln[Var(yk)] as coming from a t distribution, which as you have already mentioned is a robust version of the normal distribution. If you don't feel like assuming that the transformed variance is approximately normal as n, then you could choose a probability distribution with positive real support that is known to have heavy tails compared to another distribution with the same location. For example, there is a recent answer to a question on Cross Validated about whether the lognormal or gamma distribution has heavier tails, and it turns out that the lognormal distribution does (thanks to @Glen_b for that contribution). In addition, you could explore the half-Cauchy family.

Similar reasoning applies if instead you are assigning a prior distribution over a scale parameter for a normal distribution. Tangentially, the lognormal and inverse-gamma distributions are not advisable if you want to form a boundary avoiding prior for the purposes of posterior mode approximation because they peak sharply if you parameterize them so that the mode is near zero. See BDA3 chapter 13 for discussion. So in addition to identifying a robust model in terms of tail thickness, keep in mind that kurtosis may matter to your inference, too.

I hope this helps you as much as your answer to one of my recent questions helped me.


1
My question was about the situation when you have one group and how to robustly estimate the scale of that group. In the case of outliers I don't believe the sample variance is considered robust.
Rasmus Bååth

If you have one group, and you are estimating its normal distribution, then your question applies to the form of the prior over its scale parameter. As my answer implies, you can use a t distribution over its log transformation or choose a fat tailed distribution with positive real support, being careful about other aspects of that distribution such as its kurtosis. Bottom line, if you wan a robust model for a scale parameter, use a t distribution over its log transform or some other fat tailed distribution.
Brash Equilibrium
Используя наш сайт, вы подтверждаете, что прочитали и поняли нашу Политику в отношении файлов cookie и Политику конфиденциальности.
Licensed under cc by-sa 3.0 with attribution required.