Если бы я хотел получить вероятность 9 успехов в 16 испытаниях с вероятностью 0,6 в каждом испытании, я мог бы использовать биномиальное распределение. Что я могу использовать, если каждое из 16 испытаний имеет различную вероятность успеха?
Если бы я хотел получить вероятность 9 успехов в 16 испытаниях с вероятностью 0,6 в каждом испытании, я мог бы использовать биномиальное распределение. Что я могу использовать, если каждое из 16 испытаний имеет различную вероятность успеха?
Ответы:
Это сумма 16 (предположительно независимых) биномиальных испытаний. Предположение о независимости позволяет нам умножать вероятности. Следовательно, после двух испытаний с вероятностями и успеха вероятность успеха в обоих испытаниях равна , вероятность отсутствия успеха равна , а вероятность один успех равен . That last expression owes its validity to the fact that the two ways of getting exactly one success are mutually exclusive: at most one of them can actually happen. That means their probabilities add.
By means of these two rules--independent probabilities multiply and mutually exclusive ones add--you can work out the answers for, say, 16 trials with probabilities . To do so, you need to account for all the ways of obtaining each given number of successes (such as 9). There are ways to achieve 9 successes. One of them, for example, occurs when trials 1, 2, 4, 5, 6, 11, 12, 14, and 15 are successes and the others are failures. The successes had probabilities and and the failures had probabilities . Multiplying these 16 numbers gives the chance of this particular sequence of outcomes. Summing this number along with the 11,439 remaining such numbers gives the answer.
Of course you would use a computer.
With many more than 16 trials, there is a need to approximate the distribution. Provided none of the probabilities and get too small, a Normal approximation tends to work well. With this method you note that the expectation of the sum of trials is and (because the trials are independent) the variance is . You then pretend the distribution of sums is Normal with mean and standard deviation . The answers tend to be good for computing probabilities corresponding to a proportion of successes that differs from by no more than a few multiples of . As grows large this approximation gets ever more accurate and works for even larger multiples of away from .
One alternative to @whuber's normal approximation is to use "mixing" probabilities, or a hierarchical model. This would apply when the are similar in some way, and you can model this by a probability distribution with a density function of indexed by some parameter . you get a integral equation:
The binomial probability comes from setting , the normal approximation comes from (I think) setting (with and as defined in @whuber's answer) and then noting the "tails" of this PDF fall off sharply around the peak.
You could also use a beta distribution, which would lead to a simple analytic form, and which need not suffer from the "small p" problem that the normal approximation does - as beta is quite flexible. Using a distribution with set by the solutions to the following equations (this is the "mimimum KL divergence" estimates):
Where is the digamma function - closely related to harmonic series.
We get the "beta-binomial" compound distribution:
This distribution converges towards a normal distribution in the case that @whuber points out - but should give reasonable answers for small and skewed - but not for multimodal , as beta distribution only has one peak. But you can easily fix this, by simply using beta distributions for the modes. You break up the integral from into pieces so that each piece has a unique mode (and enough data to estimate parameters), and fit a beta distribution within each piece. then add up the results, noting that making the change of variables for the beta integral transforms to:
Let ~ with probability generating function (pgf):
Let denote the sum of such independent random variables. Then, the pgf for the sum of such variables is:
We seek , which is:
ALL DONE. This produces the exact symbolic solution as a function of the . The answer is rather long to print on screen, but it is entirely tractable, and takes less than th of a second to evaluate using Mathematica on my computer.
Examples
If , then:
If , then:
More than 16 trials?
With more than 16 trials, there is no need to approximate the distribution. The above exact method works just as easily for examples with say or . For instance, when , it takes less than th of second to evaluate the entire pmf (i.e. at every value ) using the code below.
Mathematica code
Given a vector of values, say:
n = 16; pvals = Table[Subscript[p, i] -> i/(n+1), {i, n}];
... here is some Mathematica code to do everything required:
pgfS = Expand[ Product[1-(1-t)Subscript[p,i], {i, n}] /. pvals];
D[pgfS, {t, 9}]/9! /. t -> 0 // N
0.198268
To derive the entire pmf:
Table[D[pgfS, {t,s}]/s! /. t -> 0 // N, {s, 0, n}]
... or use the even neater and faster (thanks to a suggestion from Ray Koopman below):
CoefficientList[pgfS, t] // N
For an example with , it takes just 1 second to calculate pgfS
, and then 0.002 seconds to derive the entire pmf using CoefficientList
, so it is extremely efficient.
With[{p = Range@16/17}, N@Coefficient[Times@@(1-p+p*t),t,9]]
gives the probability of 9 successes, and With[{p = Range@16/17}, N@CoefficientList[Times@@(1-p+p*t),t]]
gives the probabilities of 0,...,16 successes.
Table
for the -values is intentional to allow for more general forms not suitable with Range
. Your use of CoefficientList
is very nice! I've added an Expand
to the code above which speeds the direct approach up enormously. Even so, CoefficientList
is even faster than a ParallelTable
. It does not make much difference for under 50 (both approaches take just a tiny fraction of a second either way to generate the entire pmf), but your CoefficientList
will also be a real practical advantage when n is really large.
@wolfies comment, and my attempt at a response to it revealed an important problem with my other answer, which I will discuss later.
Specific Case (n=16)
There is a fairly efficient way to code up the full distribution by using the "trick" of using base 2 (binary) numbers in the calculation. It only requires 4 lines of R code to get the full distribution of where . Basically, there are a total of choices of the vector that the binary variables could take. Now suppose we number each distinct choice from up to . This on its own is nothing special, but now suppose that we represent the "choice number" using base 2 arithmetic. Now take so I can write down all the choices so there are choices. Then in "ordinary numbers" becomes in "binary numbers". Now suppose we write these as four digit numbers, then we have . Now look at the last digits of each number - can be thought of as , etc. Counting in binary form provides an efficient way to organise the summation. Fortunately, there is an R function which can do this binary conversion for us, called intToBits(x)
and we convert the raw binary form into a numeric via as.numeric(intToBits(x))
, then we will get a vector with elements, each element being the digit of the base 2 version of our number (read from right to left, not left to right). Using this trick combined with some other R vectorisations, we can calculate the probability that in 4 lines of R code:
exact_calc <- function(y,p){
n <- length(p)
z <- t(matrix(as.numeric(intToBits(1:2^n)),ncol=2^n))[,1:n] #don't need columns n+1,...,32 as these are always 0
pz <- z%*%log(p/(1-p))+sum(log(1-p))
ydist <- rowsum(exp(pz),rowSums(z))
return(ydist[y+1])
}
Plugging in the uniform case and the sqrt root case gives a full distribution for y as:
So for the specific problem of successes in trials, the exact calculations are straight-forward. This also works for a number of probabilities up to about - beyond that you are likely to start to run into memory problems, and different computing tricks are needed.
Note that by applying my suggested "beta distribution" we get parameter estimates of and this gives a probability estimate that is nearly uniform in , giving an approximate value of . This seems strange given that a density of a beta distribution with closely approximates the histogram of the values. What went wrong?
General Case
I will now discuss the more general case, and why my simple beta approximation failed. Basically, by writing and then mixing over with another distribution is actually making an important assumption - that we can approximate the actual probability with a single binomial probability - the only problem that remains is which value of to use. One way to see this is to use the mixing density which is discrete uniform over the actual . So we replace the beta distribution with a discrete density of . Then using the mixing approximation can be expressed in words as choose a value with probability , and assume all bernoulli trials have this probability. Clearly, for such an approximation to work well, most of the values should be similar to each other. This basically means that for @wolfies uniform distribution of values, results in a woefully bad approximation when using the beta mixing distribution. This also explains why the approximation is much better for - they are less spread out.
The mixing then uses the observed to average over all possible choices of a single . Now because "mixing" is like a weighted average, it cannot possibly do any better than using the single best . So if the are sufficiently spread out, there can be no single that could provide a good approximation to all .
One thing I did say in my other answer was that it may be better to use a mixture of beta distributions over a restricted range - but this still won't help here because this is still mixing over a single . What makes more sense is split the interval up into pieces and have a binomial within each piece. For example, we could choose as our splits and fit nine binomials within each range of probability. Basically, within each split, we would fit a simple approximation, such as using a binomial with probability equal to the average of the in that range. If we make the intervals small enough, the approximation becomes arbitrarily good. But note that all this does is leave us with having to deal with a sum of indpendent binomial trials with different probabilities, instead of Bernoulli trials. However, the previous part to this answer showed that we can do the exact calculations provided that the number of binomials is sufficiently small, say 10-15 or so.
To extend the bernoulli-based answer to a binomial-based one, we simply "re-interpret" what the variables are. We simply state that - this reduces to the original bernoulli-based but now says which binomials the successes are coming from. So the case now means that all the "successes" come from the third binomial, and none from the first two.
Note that this is still "exponential" in that the number of calculations is something like where is the number of binomials, and is the group size - so you have where . But this is better than the that you'd be dealing with by using bernoulli random variables. For example, suppose we split the probabilities into groups with probabilities in each group. This gives calculations, compared to
By choosing groups, and noting that the limit was about which is about cells, we can effectively use this method to increase the maximum to .
If we make a cruder approximation, by lowering , we will increase the "feasible" size for . means that you can have an effective of about . Beyond this the normal approximation should be extremely accurate.
R
that is extremely efficient and handles much, much larger values of , please see stats.stackexchange.com/a/41263. For instance, it solved this problem for , giving the full distribution, in under three seconds. (A comparable Mathematica 9 solution--see @wolfies' answer--also performs well for smaller but could not complete the execution with such a large value of .)
The (in general intractable) pmf is
p <- seq(1, 16) / 17
cat(p, "\n")
n <- length(p)
k <- 9
S <- seq(1, n)
A <- combn(S, k)
pr <- 0
for (i in 1:choose(n, k)) {
pr <- pr + exp(sum(log(p[A[,i]])) + sum(log(1 - p[setdiff(S, A[,i])])))
}
cat("Pr(S = ", k, ") = ", pr, "\n", sep = "")
For the 's used in wolfies answer, we have:
Pr(S = 9) = 0.1982677
When grows, use a convolution.
R
code in the solution to the same problem (with different values of the ) at stats.stackexchange.com/a/41263. The problem here is solved in 0.00012 seconds total computation time (estimated by solving it 1000 times) compared to 0.53 seconds (estimated by solving it once) for this R
code and 0.00058 seconds using Wolfies' Mathematica code (estimated by solving it 1000 times).