Интервал прогнозирования для биномиальной случайной величины

Какова формула (приблизительная или точная) для интервала предсказания для биномиальной случайной величины?

Предположим, что $Y \sim \mathsf{Binom}(n, p)$ , и мы наблюдаем $y$ (взятый из $Y$ ). $n$ известно.

Наша цель состоит в том, чтобы получить интервал прогнозирования на 95% для нового розыгрыша от $Y$ .

Оценка точка $n\hat{p}$ , где $\hat{p}=\frac{y}{n}$ . Доверительный интервал для является простым, но я не могу найти формулу для интервала прогнозирования для. Если бы мы знали(а не то интервал прогнозирования 95% просто предполагает нахождение квантилей бинома. Есть ли что-то очевидное, что я пропускаю? $\hat{p}$ $Y$ $p$ $\hat{p}$

confidence-interval binomial prediction-interval

— Statseeker
источник

См. Какие не байесовские методы существуют для прогнозного вывода? , В этом случае метод с использованием сводок недоступен (я не думаю), но вы можете использовать одну из предиктивных вероятностей. Или, конечно, байесовский подход.

— Scortchi - Восстановить Монику

Привет, ребята, я хотел бы воспользоваться моментом для решения проблем, которые были подняты. - относительно доверия для p: я не заинтересован в этом. - в отношении прогнозирования, составляющего 95% распределения: да, это именно то, что интервалы прогнозирования являются независимыми от контекста (в регрессии вы должны допускать нормальные ошибки, где в качестве доверительных интервалов используется CLT - да, пример прогнозирования количества голов в бросок монеты - это правильно. Что делает эту проблему трудной, так это то, что у нас нет «р», а есть оценка.

— Statseeker

@ Аддисон. Прочитайте книгу «Статистические интервалы» Дж. Хана и В. Микера. Они объясняют разницу между доверительными интервалами, интервалами прогнозирования, интервалами допуска и байесовскими достоверными интервалами. Интервал прогнозирования 95% не содержит 95% распределения. Это делает то, что делают самые частые интервалы. Если вы неоднократно производите выборку из B (n, p) и каждый раз используете один и тот же метод для получения 95% интервала прогнозирования для p, то в 95% интервалов прогнозирования вы будете содержать истинное значение p. Если вы хотите покрыть 95% распределения, создайте интервал допуска.

— Майкл Р. Черник

Интервалы допусков покрывают процент от распределения. Для интервала допуска 95% для 90% распределения вы снова повторяете процесс много раз и используете один и тот же метод для генерации интервала каждый раз, тогда примерно в 95% случаев по меньшей мере 90% распределения попадет в интервал и в 5% случаев в интервале будет содержаться менее 90% распределения.

— Майкл Р. Черник

Lawless & Fredette (2005), «Интервалы прогнозирования для частоты и прогнозирующие распределения», Biometrika , 92 , 3 - еще одна хорошая ссылка, в дополнение к ссылкам, которые я дал.

— Scortchi - Восстановить Монику

Хорошо, давайте попробуем это. Я дам два ответа - байесовский, который, на мой взгляд, простой и естественный, и один из возможных частых.

Байесовское решение

Мы предполагаем , бета перед на , я, е., , так как модель Бета-биномиальное сопряжена, что означает , что задняя распределение также бета - распределение с параметрами , (я использую , чтобы обозначить число успехов в испытаниях, вместо того , ). Таким образом, вывод значительно упрощается. Теперь, если у вас есть некоторые предварительные знания о вероятных значениях $p$ $p \sim Beta(\alpha,\beta)$ $\hat{\alpha}=\alpha+k,\hat{\beta}=\beta+n-k$ $k$ $n$ $y$ $p$ , you could use it to set the values of $\alpha$ and $\beta$ , i.e., to define your Beta prior, otherwise you could assume a uniform (noninformative) prior, with $\alpha=\beta=1$ , or other noninformative priors (see for example here). In any case, your posterior is

$Pr(p|n,k)=Beta(\alpha+k,\beta+n-k)$

В байесовском умозаключении все, что имеет значение, это апостериорная вероятность, означающая, что, как только вы это знаете, вы можете делать выводы для всех других величин в вашей модели. Вы хотите сделать вывод о наблюдаемых : в частности, о векторе новых результатов , где не обязательно равно . В частности, для каждого мы хотим вычислить вероятность достижения именно успехов в следующих испытаниях, учитывая, что мы получили $y$ $\mathbf{y}=y_1,\dots,y_m$ $m$ $n$ $j=0,\dots,m$ $j$ $m$ $k$ успехи в предыдущих испытаниях; задняя предиктивная функция массы: $n$

However, our Binomial model for $Y$ means that, conditionally on $p$ having a certain value, the probability of having $j$ successes in $m$ trials doesn't depend on past results: it's simply

$f(j|m,p)=\binom{j}{m} p^j(1-p)^j$

Thus the expression becomes

$Pr(j|m,n,k)=\int_0^1 \binom{j}{m} p^j(1-p)^j Pr(p|n,k)dp=\int_0^1 \binom{j}{m} p^j(1-p)^j Beta(\alpha+k,\beta+n-k)dp$

The result of this integral is a well-known distribution called the Beta-Binomial distribution: skipping the passages, we get the horrible expression

$Pr(j|m,n,k)=\frac{m!}{j!(m-j)!}\frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+k)\Gamma(\beta+n-k)}\frac{\Gamma(\alpha+k+j)\Gamma(\beta+n+m-k-j)}{\Gamma(\alpha+\beta+n+m)}$

Our point estimate for $j$ , given quadratic loss, is of course the mean of this distribution, i.e.,

$\mu=\frac{m(\alpha+k)}{(\alpha+\beta+n)}$

Now, let's look for a prediction interval. Since this is a discrete distribution, we don't have a closed form expression for $[j_1,j_2]$ , such that $Pr(j_1\leq j \leq j_2)= 0.95$ . The reason is that, depending on how you define a quantile, for a discrete distribution the quantile function is either not a function or is a discontinuous function. But this is not a big problem: for small $m$ , you can just write down the $m$ probabilities $Pr(j=0|m,n,k),Pr(j\leq 1|m,n,k),\dots,Pr(j \leq m-1|m,n,k)$ and from here find $j_1,j_2$ such that

$Pr(j_1\leq j \leq j_2)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\geq 0.95$

Of course you would find more than one couple, so you would ideally look for the smallest $[j_1,j_2]$ such that the above is satisfied. Note that

$Pr(j=0|m,n,k)=p_0,Pr(j\leq 1|m,n,k)=p_1,\dots,Pr(j \leq m-1|m,n,k)=p_{m-1}$

are just the values of the CMF (Cumulative Mass Function) of the Beta-Binomial distribution, and as such there is a closed form expression, but this is in terms of the generalized hypergeometric function and thus is quite complicated. I'd rather just install the R package extraDistr and call pbbinom to compute the CMF of the Beta-Binomial distribution. Specifically, if you want to compute all the probabilities $p_0,\dots,p_{m-1}$ in one go, just write:

library(extraDistr)  
jvec <- seq(0, m-1, by = 1) 
probs <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

where alpha and beta are the values of the parameters of your Beta prior, i.e., $\alpha$ and $\beta$ (thus 1 if you're using a uniform prior over $p$ ). Of course it would all be much simpler if R provided a quantile function for the Beta-Binomial distribution, but unfortunately it doesn't.

Practical example with the Bayesian solution

Let $n=100$ , $k=70$ (thus we initially observed 70 successes in 100 trials). We want a point estimate and a 95%-prediction interval for the number of successes $j$ in the next $m=20$ trials. Then

n <- 100
k <- 70
m <- 20
alpha <- 1
beta  <- 1

where I assumed a uniform prior on $p$ : depending on the prior knowledge for your specific application, this may or may not be a good prior. Thus

bayesian_point_estimate <- m * (alpha + k)/(alpha + beta + n) #13.92157

Clearly a non-integer estimate for $j$ doesn't make sense, so we could just round to the nearest integer (14). Then, for the prediction interval:

jvec <- seq(0, m-1, by = 1)
library(extraDistr)
probabilities <- pbbinom(jvec, m, alpha = alpha + k, beta = beta + n - k)

The probabilities are

> probabilities
 [1] 1.335244e-09 3.925617e-08 5.686014e-07 5.398876e-06
 [5] 3.772061e-05 2.063557e-04 9.183707e-04 3.410423e-03
 [9] 1.075618e-02 2.917888e-02 6.872028e-02 1.415124e-01
[13] 2.563000e-01 4.105894e-01 5.857286e-01 7.511380e-01
[17] 8.781487e-01 9.546188e-01 9.886056e-01 9.985556e-01

For an equal-tail probabilities interval, we want the smallest $j_2$ such that $Pr(j\leq j_2|m,n,k)\ge 0.975$ and the largest $j_1$ such that $Pr(j < j_1|m,n,k)=Pr(j \le j_1-1|m,n,k)\le 0.025$ . This way, we will have

$Pr(j_1\leq j \leq j_2|m,n,k)=Pr(j\leq j_2|m,n,k)-Pr(j < j_1|m,n,k)\ge 0.975-0.025=0.95$

Thus, by looking at the above probabilities, we see that $j_2=18$ and $j_1=9$ . The probability of this Bayesian prediction interval is 0.9778494, which is larger than 0.95. We could find shorter intervals such that $Pr(j_1\leq j \leq j_2|m,n,k)\ge 0.95$ , but in that case at least one of the two inequalities for the tail probabilities wouldn't be satisfied.

Frequentist solution

I'll follow the treatment of Krishnamoorthy and Peng, 2011. Let $Y\sim Binom(m,p)$ and $X\sim Binom(n,p)$ be independently Binominally distributed. We want a $1-2\alpha-$ prediction interval for $Y$ , based on a observation of $X$ . In other words we look for $I=[L(X;n,m,\alpha),U(X;n,m,\alpha)]$ such that:

$Pr_{X,Y}(Y\in I)=Pr_{X,Y}(L(X;n,m,\alpha)\leq Y\leq U(X;n,m,\alpha)]\geq 1-2\alpha$

The " $\geq 1-2\alpha$ " is due to the fact that we are dealing with a discrete random variable, and thus we cannot expect to get exact coverage...but we can look for an interval which has always at least the nominal coverage, thus a conservative interval. Now, it can be proved that the conditional distribution of $X$ given $X+Y=k+j=s$ is hypergeometric with sample size $s$ , number of successes in the population $n$ and population size $n+m$ . Thus the conditional pmf is

$Pr(X=k|X+Y=s,n,n+m)=\frac{\binom{n}{k}\binom{m}{s-k}}{\binom{m+n}{s}}$

The conditional CDF of $X$ given $X+Y=s$ is thus

$Pr(X\leq k|s,n,n+m)=H(k;s,n,n+m)=\sum_{i=0}^k\frac{\binom{n}{i}\binom{m}{s-i}}{\binom{m+n}{s}}$

The first great thing about this CDF is that it doesn't depend on $p$ , which we don't know. The second great thing is that it allows to easily find our PI: as a matter of fact, if we observed a value $k$ of X, then the $1-\alpha$ lower prediction limit is the smallest integer $L$ such that

$Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$

correspondingly, the the $1-\alpha$ upper prediction limit is the largest integer such that

$Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$

Thus, $[L,U]$ is a prediction interval for $Y$ of coverage at least $1-2\alpha$ . Note that when $p$ is close to 0 or 1, this interval is conservative even for large $n$ , $m$ , i.e., its coverage is quite larger than $1-2\alpha$ .

Practical example with the Frequentist solution

Same setting as before, but we don't need to specify $\alpha$ and $\beta$ (there are no priors in the Frequentist framework):

n <- 100
k <- 70
m <- 20

The point estimate is now obtained using the MLE estimate for the probability of successes, $\hat{p}=\frac{k}{n}$ , which in turns leads to the following estimate for the number of successes in $m$ trials:

frequentist_point_estimate <- m * k/n #14

For the prediction interval, the procedure is a bit different. We look for the largest $U$ such that $Pr(X\leq k|k+U,n,n+m)=H(k;k+U,n,n+m)>\alpha$ , thus let's compute the above expression for all $U$ in $[0,m]$ :

jvec <- seq(0, m, by = 1)
probabilities <- phyper(k,n,m,k+jvec)

We can see that the largest $U$ such that the probability is still larger than 0.025 is

jvec[which.min(probabilities > 0.025) - 1] # 18

Same as for the Bayesian approach. The lower prediction bound $L$ is the smallest integer such that $Pr(X\geq k|k+L,n,n+m)=1-H(k-1;k+L,n,n+m)>\alpha$ , thus

probabilities <- 1-phyper(k-1,n,m,k+jvec)
jvec[which.max(probabilities > 0.025) - 1] # 8

Thus our frequentist "exact" prediction interval is $[L,U]=[8,18]$ .

— DeltaIV
источник