Как рассчитать взвешенное стандартное отклонение? В Excel?


29

Итак, у меня есть набор данных процентов, например, так:

100   /   10000   = 1% (0.01)
2     /     5     = 40% (0.4)
4     /     3     = 133% (1.3) 
1000  /   2000    = 50% (0.5)

Я хочу найти стандартное отклонение в процентах, но взвешенное для их объема данных. т.е. первая и последняя точки данных должны доминировать в расчете.

Как я могу это сделать? И есть ли простой способ сделать это в Excel?


Формула с (M-1) / M верна. Если у вас есть сомнения, проверьте их, установив все веса равными 1, и вы получите классическую формулу для несмещенной оценки для стандартного отклонения с (N-1) в знаменателе. Кому: необычное не значит неправильное.

1
Формула с (M-1) / M НЕ ПРАВИЛЬНА. Представьте, что вы добавили миллион очков с весами в одну триллионную. Вы не меняете свой ответ вообще, независимо от того, каковы эти веса, но ваш термин становится 1? Точно нет! Если вас волнует это , то вам также важно, что это просто неправильно. (M1)/M(M1)/M1
Рекс Керр,

Самый высокий голос правильный. Пожалуйста, проверьте itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf
Бо Ван

Интересно, почему вы хотите стандартное отклонение здесь? у вас есть только номера! Как это слишком много чисел? Особенно, когда проценты легче объяснить и понять. 4
вероятностная

@probabilityislogic - это упрощенный пример для краткости вопроса.
Яхель

Ответы:


35

Формула для взвешенного стандартного отклонения является:

i=1Nwi(xix¯)2(M1)Mi=1Nwi,

где

N is the number of observations.

M is the number of nonzero weights.

wi are the weights

xi are the observations.

x¯ is the weighted mean.

Remember that the formula for weighted mean is:

x¯=i=1Nwixii=1Nwi.

Use the appropriate weights to get the desired result. In your case I would suggest to use Number of cases in segmentTotal number of cases.

To do this in Excel, you need to calculate the weighted mean first. Then calculate the (xix¯)2 in a separate column. The rest must be very easy.


2
@Gilles, you're right. deps_stats, the fraction (M1)/M in the SD is unusual. Do you have a citation for this formula or can you at least explain the reason for including that term?
whuber

4
@Aaron Weights are not always defined to sum to unity, as exemplified by the weights given in this question!
whuber

2
(-1) I am downvoting this answer because no justification or reference for the (M1)/M term has been provided (and I'm pretty sure it does not make the estimate of the variance unbiased, which would be its apparent motivation).
whuber

1
In light of the added reference (which is not authoritative, but it is a reference) I am removing the downvote. I am not upvoting this answer, though, because calculations show the proposed weighting does not produce an unbiased estimate of anything at all (except when all weights equal 1). The real difficulty here--which is the fault of the question, not the answer--is that it's not clear what this "weighted standard deviation" is attempting to estimate. Without a definite estimand, there is no justification to introduce an (M1)/M factor to "reduce bias" (or for any other reason).
whuber

1
@Mikhail You are correct that "unusual" and "right" have little to do with one another. However, unusual results do implicitly demand a little more justification because being unusual is one indicator that an error may have been made. Your argument is invalid: although the formula indeed reduces to one for an unbiased estimator when all weights are equal, that does not imply the estimator remains unbiased when unequal weights are used. I am not asserting your conclusion is wrong, but only that so far no valid justification has been offered.
whuber

18

The formulae are available various places, including Wikipedia.

The key is to notice that it depends on what the weights mean. In particular, you will get different answers if the weights are frequencies (i.e. you are just trying to avoid adding up your whole sum), if the weights are in fact the variance of each measurement, or if they're just some external values you impose on your data.

In your case, it superficially looks like the weights are frequencies but they're not. You generate your data from frequencies, but it's not a simple matter of having 45 records of 3 and 15 records of 4 in your data set. Instead, you need to use the last method. (Actually, all of this is rubbish--you really need to use a more sophisticated model of the process that is generating these numbers! You apparently do not have something that spits out Normally-distributed numbers, so characterizing the system with the standard deviation is not the right thing to do.)

In any case, the formula for variance (from which you calculate standard deviation in the normal way) with "reliability" weights is

wi(xix)2wiwi2wi

where x=wixi/wi is the weighted mean.

You don't have an estimate for the weights, which I'm assuming you want to take to be proportional to reliability. Taking percentages the way you are is going to make analysis tricky even if they're generated by a Bernoulli process, because if you get a score of 20 and 0, you have infinite percentage. Weighting by the inverse of the SEM is a common and sometimes optimal thing to do. You should perhaps use a Bayesian estimate or Wilson score interval.


2
+1. The discussion of the different meanings of weights was what I was looking for in this thread all along. It is an important contribution to all of this site's questions about weighted statistics. (I am a little concerned about the parenthetical remarks concerning normal distributions and standard deviations, though, because they incorrectly suggest that SDs have no use outside a model based on normality.)
whuber

@whuber - Well, central limit theorem to the rescue, of course! But for what the OP was doing, trying to characterize that set of numbers with a mean and standard deviation seems exceedingly inadvisable. And in general, for many uses the standard deviation ends up luring one into a false feeling of understanding. For instance, if the distribution is anything but normal (or a good approximation thereof), relying on the standard deviation will give you a bad idea of the shape of the tails, when it is exactly those tails that you probably most care about in statistical testing.
Rex Kerr

@RexKerr We can hardly blame standard deviation if people place interpretations on it that are undeserved. But let's move away from normality and consider the much broader class of continuous, symmetric unimodal distributions with finite variance (for example). Then between 89 and 100 percent of the distribution lies within two standard deviations. That's often pretty useful to know (and 95% lies pretty much in the middle, so it's never more than about 7% off); with many common distributions, the dropping symmetry aspect doesn't change much (e.g. look at the exponential, for example).... ctd
Glen_b -Reinstate Monica

ctd... -- or if we don't make any of those assumptions, there's always the ordinary Chebyshev bounds which do at least say something about the tails and standard deviation..
Glen_b -Reinstate Monica

1
@Gabriel - Yes, sorry, I was being sloppy. (I figure people can tell which is which by glancing.) I've corrected my description.
Rex Kerr

5
=SQRT(SUM(G7:G16*(H7:H16-(SUMPRODUCT(G7:G16,H7:H16)/SUM(G7:G16)))^2)/
     ((COUNTIFS(G7:G16,"<>0")-1)/COUNTIFS(G7:G16,"<>0")*SUM(G7:G16)))

Column G are weights, Column H are values


Using Ctrl+Shift+ Enter was a gotcha for me, but this seems to work otherwise.
philipkd

1

If we treat weights like probabilities, then we build them as follows:

pi=viivi,
where vi - data volume.

Next, obviously the weighted mean is

μ^=ipixi,
and the variance:
σ^2=ipi(xiμ^)2

0
Option Explicit

Function wsdv(vals As Range, wates As Range)
Dim i, xV, xW, y As Integer
Dim wi, xi, WgtAvg, N
Dim sumProd, SUMwi

    sumProd = 0
    SUMwi = 0
    N = vals.Count  ' number of values to determine W Standard Deviation
    xV = vals.Column  ' Column number of first value element
    xW = wates.Column  ' Column number of first weight element
    y = vals.Row - 1  ' Row number of the values and weights

    WgtAvg = WorksheetFunction.SumProduct(vals, wates) / WorksheetFunction.Sum(wates)

    For i = 1 To N  ' step through the elements, calculating the sum of values and the sumproduct
        wi = ActiveSheet.Cells(i + y, xW).Value  ' (i+y, xW) is the cell containing the weight element
        SUMwi = SUMwi + wi
        xi = ActiveSheet.Cells(i + y, xV).Value  ' (i+y, xV) is the cell containing the value element
        sumProd = sumProd + wi * (xi - WgtAvg) ^ 2
    Next i

    wsdv = (sumProd / SUMwi * N / (N - 1)) ^ (1 / 2)  ' output of weighted standard deviation

End Function

2
Welcome to the site, @uswer71015. This seems to be only code. Can you add some text / explanation of how the code works & how it answers the question?
gung - Reinstate Monica
Используя наш сайт, вы подтверждаете, что прочитали и поняли нашу Политику в отношении файлов cookie и Политику конфиденциальности.
Licensed under cc by-sa 3.0 with attribution required.