Как найти значения, не указанные в (интерполировать) статистических таблицах?

Часто люди используют программы для получения p-значений, но иногда - по какой-либо причине - может потребоваться получить критическое значение из набора таблиц.

Учитывая статистическую таблицу с ограниченным числом уровней значимости и ограниченным числом степеней свободы, как получить приблизительные критические значения при других уровнях значимости или степенях свободы (например, с $t$ , хи-квадрат илитаблиц ) ? $F$

То есть как найти значения «между» значениями в таблице?

— Glen_b - Восстановить Монику
источник

Этот ответ состоит из двух основных частей: во-первых, с использованием линейной интерполяции , а во-вторых, с использованием преобразований для более точной интерполяции. Обсуждаемые здесь подходы подходят для ручного расчета, когда у вас есть ограниченное количество доступных таблиц, но если вы реализуете компьютерную подпрограмму для получения значений p, есть гораздо лучшие подходы (если они утомительны, когда выполняются вручную), которые следует использовать вместо этого.

Если вы знали, что критическое значение 10% (односторонний) для z-теста составляло 1,28, а критическое значение 20% составляло 0,84, приблизительное предположение о критическом значении 15% было бы на полпути между - (1,28 + 0,84) / 2 = 1,06 (фактическое значение 1,0364), и значение 12,5% можно было бы угадать на полпути между этим значением и значением 10% (1,28 + 1,06) / 2 = 1,17 (фактическое значение 1,15+). Это именно то, что делает линейная интерполяция - но вместо «на полпути» она смотрит на любую долю пути между двумя значениями.

Одномерная линейная интерполяция

Давайте посмотрим на случай простой линейной интерполяции.

Итак, у нас есть некоторая функция (скажем, ), которая, по нашему мнению, приблизительно линейна вблизи значения, которое мы пытаемся приблизить, и у нас есть значение функции по обе стороны от значения, которое мы хотим, например, так: $x$

\begin{array}{cc} x & y \\ 8 & 9.3 \\ 16 & y_{16} \\ 20 & 15.6 \end{array}

$\begin{array}{ c c } x & y\\ 8 & 9.3\\ 16 & y_{16}\\ 20 & 15.6\\ \end{array}$

Два значения , мы знаем, находятся на расстоянии 12 (20-8). Посмотрите, как значение (то, для которого мы хотим приблизительное значение ) делит эту разницу на 12 в соотношении 8: 4 (16-8 и 20-16)? То есть это 2/3 расстояния от первого значения до последнего. Если бы отношения были линейными, соответствующий диапазон значений y был бы в том же соотношении. $x$ $y$ $x$ $y$ $x$

линейная интерполяция

Так должны быть примерно такими же, как $\frac{y_{16} - 9.3}{15.6 - 9.3}$ . $\frac{16-8}{20-8}$

То есть $\frac{y_{16} - 9.3}{15.6 - 9.3} \approx \frac{16-8}{20-8}$

переставляя:

$y_{16} \approx 9.3 + (15.6 - 9.3) \frac{16-8}{20-8} = 13.5$

Пример со статистическими таблицами: если у нас есть t-таблица со следующими критическими значениями для 12 df:

\begin{array}{cc} (2 -tail) \\ α & t \\ 0.01 & 3.05 \\ 0.02 & 2.68 \\ 0.05 & 2.18 \\ 0.10 & 1.78 \end{array}

$\begin{array}{ c c } (2\text{-tail})& \\ α & t\\ 0.01 & 3.05\\ 0.02 & 2.68\\ 0.05 & 2.18\\ 0.10 & 1.78 \end{array}$

Нам нужно критическое значение t с 12 df и альфа с двумя хвостами 0,025. То есть мы интерполируем строки 0,02 и 0,05 этой таблицы:

\begin{array}{cc} α & t \\ 0.02 & 2.68 \\ 0.025 & ? \\ 0.05 & 2.18 \end{array}

$\begin{array}{ c c } α & t\\ 0.02 & 2.68\\ 0.025 & \text{?}\\ 0.05 & 2.18\\ \end{array}$

Значение в « » - это значение которое мы хотим использовать для приблизительной линейной интерполяции. (Под я на самом деле имею в виду точку обратной cdf для $\text{?}$ $t_{0.025}$ $t_{0.025}$ $1-0.025/2$ $t_{12}$ ).

Как и раньше, делит интервал от до в соотношении до (т.е. ) и неизвестный -value следует разделить диапазон до в том же соотношении; что то же самое, происходит $0.025$ $0.02$ $0.05$ $(0.025-0.02)$ $(0.05-0.025)$ $1:5$ $t$ $t$ $2.68$ $2.18$ $0.025$ - го пути вдоль -range, поэтому неизвестной -значение должно произойти - ю часть пути вдоль -range. $(0.025-0.02)/(0.05-0.02) = 1/6$ $x$ $t$ $1/6$ $t$

Это или эквивалентно $\frac{t_{0.025}-2.68}{2.18-2.68} \approx \frac{0.025-0.02}{0.05-0.02}$

$t_{0.025} \approx 2.68 + (2.18-2.68) \frac{0.025-0.02}{0.05-0.02} = 2.68 - 0.5 \frac{1}{6} \approx 2.60$

Фактический ответ ... который не особенно близко , так как функция мы аппроксимирующая не очень близка к линейной в этом диапазоне (ближе это). $2.56$ $\alpha = 0.5$

linear interpolation of critical value in t-tables

Лучшие приближения через преобразование

Мы можем заменить линейную интерполяцию другими функциональными формами; по сути, мы преобразуем в масштаб, где линейная интерполяция работает лучше. В этом случае, в хвосте, многие табличные критические значения более близки к линейному уровня значимости. После того, как мы берем s, мы просто применяем линейную интерполяцию, как и раньше. Давайте попробуем это на примере выше: $\log$ $\log$

\begin{array}{cc} α & \log (α) & t \\ 0.02 & - 3.912 & 2.68 \\ 0.025 & - 3.689 & t_{0.025} \\ 0.05 & - 2.996 & 2.18 \end{array}

$\begin{array}{ c c } α & \log(α)& t\\ 0.02 & -3.912 & 2.68\\ 0.025& -3.689 & t_{0.025}\\ 0.05 & -2.996 & 2.18\\ \end{array}$

Сейчас

\begin{array}{rcl} \frac{t_{0.025} - 2.68}{2.18 - 2.68} & \approx & \frac{\log (0.025) - \log (0.02)}{\log (0.05) - \log (0.02)} \\ = & \frac{- 3.689 - - 3.912}{- 2.996 - - 3.912} \end{array}

$\begin{eqnarray} \frac{t_{0.025}-2.68}{2.18-2.68} &\approx& \frac{\log(0.025)-\log(0.02)}{\log(0.05)-\log(0.02)} \\ &=& \frac{-3.689 - -3.912}{-2.996 - -3.912}\\ \end{eqnarray}$

or equivalently

\begin{array}{rcl} t_{0.025} & \approx & 2.68 + (2.18 - 2.68) \frac{- 3.689 - - 3.912}{- 2.996 - - 3.912} \\ = & 2.68 - 0.5 \cdot 0.243 \approx 2.56 \end{array}

$\begin{eqnarray} t_{0.025} &\approx& 2.68 + (2.18-2.68) \frac{-3.689 - -3.912}{-2.996 - -3.912}\\ &=& 2.68 - 0.5 \cdot 0.243 \approx 2.56 \end{eqnarray}$

Which is correct to the quoted number of figures. This is because - when we transform the x-scale logarithmically - the relationship is almost linear:

linear interpolation in log alpha
Indeed, visually the curve (grey) lies neatly on top of the straight line (blue).

In some cases, the logit of the significance level ( $\text{logit}(\alpha)=\log(\frac{α}{1-α})=\log(\frac{1}{1-α}-1)$ ) may work well over a wider range but is usually not necessary (we usually only care about accurate critical values when $\alpha$ is small enough that $\log$ works quite well).

Interpolation across different degrees of freedom

$t$ , chi-square and $F$ tables also have degrees of freedom, where not every df ( $\nu$ -) value is tabulated. The critical values mostly $^\dagger$ aren't accurately represented by linear interpolation in the df. Indeed, often it's more nearly the case that the tabulated values are linear in the reciprocal of df, $1/\nu$ .

(In old tables you'd often see a recommendation to work with $120/\nu$ - the constant on the numerator makes no difference, but was more convenient in pre-calculator days because 120 has a lot of factors, so $120/\nu$ is often an integer, making the calculation a bit simpler.)

Here's how inverse interpolation performs on 5% critical values of $F_{4,\nu}$ between $\nu = 60$ and $120$ . That is, only the endpoints participate in the interpolation in $1/\nu$ . For example, to compute the critical value for $\nu=80$ , we take (and note that here $F$ represents the inverse of the cdf):

F_{4, 80, .95} \approx F_{4, 60, .95} + \frac{1 / 80 - 1 / 60}{1 / 120 - 1 / 60} \cdot (F_{4, 120, .95} - F_{4, 60, .95})

$F_{4,80,.95} \approx F_{4,60,.95} + \frac{1/80 - 1/60}{1/120 - 1/60} \cdot (F_{4,120,.95}-F_{4,60,.95})$

inverse interp in df

(Compare with diagram here)

$^\dagger$ Mostly but not always. Here's an example where linear interpolation in df is better, and an explanation of how to tell from the table that linear interpolation is going to be accurate.

Here's a piece of a chi-squared table

            Probability less than the critical value
 df           0.90      0.95     0.975      0.99     0.999
______   __________________________________________________

 40         51.805    55.758    59.342    63.691    73.402
 50         63.167    67.505    71.420    76.154    86.661
 60         74.397    79.082    83.298    88.379    99.607
 70         85.527    90.531    95.023   100.425   112.317

Imagine we wish to find the 5% critical value (95th percentiles) for 57 degrees of freedom.

Looking closely, we see that the 5% critical values in the table progress almost linearly here:

(the green line joins the values for 50 and 60 df; you can see it touches the dots for 40 and 70)

So linear interpolation will do very well. But of course we don't have time to draw the graph; how to decide when to use linear interpolation and when to try something more complicated?

As well as the values either side of the one we seek, take the next nearest value (70 in this case). If the middle tabulated value (the one for df=60) is close to linear between the end values (50 and 70), then linear interpolation will be suitable. In this case the values are equispaced so it's especially easy: is $(x_{50,0.95}+x_{70,0.95})/2$ close to $x_{60,0.95}$ ?

We find that $(67.505+90.531)/2 = 79.018$ , which when compared to the actual value for 60 df, 79.082, we can see is accurate to almost three full figures, which is usually pretty good for interpolation, so in this case, you'd stick with linear interpolation; with the finer step for the value we need we would now expect to have effectively 3 figure accuracy.

So we get: $\frac{x-67.505}{79.082-67.505} \approx {57-50}{60-50}$ or

$x\approx 67.505+(79.082-67.505)\cdot {57-50}{60-50}\approx 75.61$ .

The actual value is 75.62375, so we indeed got 3 figures of accuracy and were only out by 1 in the fourth figure.

More accurate interpolation still may be had by using methods of finite differences (in particular, via divided differences), but this is probably overkill for most hypothesis testing problems.

If your degrees of freedom go past the ends of your table, this question discusses that problem.

— Glen_b -Reinstate Monica
источник