Почему для подсчета используется регрессия Пуассона?

33

Я понимаю, что для определенных наборов данных, таких как голосование, он работает лучше. Почему регрессия Пуассона используется поверх обычной линейной регрессии или логистической регрессии? Какова математическая мотивация для этого?

count-data poisson-regression

— zaxtax
источник

Смотрите мой ответ на этот пост с другой точки зрения: stats.stackexchange.com/questions/142338/…

— kjetil b halvorsen

51

Poisson distributed data is intrinsically integer-valued, which makes sense for count data. Ordinary Least Squares (OLS, which you call "linear regression") assumes that true values are normally distributed around the expected value and can take any real value, positive or negative, integer or fractional, whatever. Finally, logistic regression only works for data that is 0-1-valued (TRUE-FALSE-valued), like "has a disease" versus "doesn't have the disease". Thus, the Poisson distribution makes the most sense for count data.

That said, a normal distribution is often a rather good approximation to a Poisson one for data with a mean above 30 or so. And in a regression framework, where you have predictors influencing the count, an OLS with its normal distribution may be easier to fit and would actually be more general, since the Poisson distribution and regression assume that the mean and the variance are equal, while OLS can deal with unequal means and variances - for a count data model with different means and variances, one could use a negative binomial distribution, for instance.

— S. Kolassa - Reinstate Monica
источник

17

Note that just fitting using OlS doesn't require normality - it's when you do inference on the parameters that you need the normal distribution asssumption

— Dason

1

@Dason: I stand corrected.

— S. Kolassa - Reinstate Monica

3

If you use the Huber/White/Sandwich estimator of variance, you can relax the mean-variance assumption

— Dimitriy V. Masterov

@Dason Хотя это не обязательно, использование правильной формы модели для того, что вы подходите, почти всегда дает лучшую оценку, и вы можете увидеть это в графиках остатков.

— Джо

24

По сути, это потому, что линейная и логистическая регрессия делают неправильные предположения о том, как выглядят результаты подсчета. Представьте, что ваша модель - очень глупый робот, который будет неуклонно следовать вашим приказам, независимо от того, насколько глупы эти заказы; у него полностью отсутствует способность оценивать то, что вы говорите. Если вы скажете своему роботу, что что-то вроде голосов распределяется непрерывно от отрицательной бесконечности до бесконечности, это то, что он считает голосами, и это может дать вам бессмысленные прогнозы (Росс Перо получит -10,469 голосов на предстоящих выборах).

Наоборот, распределение Пуассона дискретно и положительно (или ноль ... ноль считается положительным, да?). Как минимум, это заставит вашего робота дать вам ответы, которые на самом деле могут произойти в реальной жизни. Они могут или не могут быть хорошими ответами, но, по крайней мере, они будут взяты из возможного набора «количество поданных голосов».

Конечно, у Пуассона есть свои проблемы: он предполагает, что среднее значение переменной подсчета голосов также будет равно ее дисперсии. Я не знаю, видел ли я когда-нибудь необдуманный пример, где это было правдой. К счастью, умные люди придумали другие распределения, которые также являются положительными и дискретными, но в них добавлены параметры, позволяющие отклонениям, например, отрицательным (например, отрицательная биномиальная регрессия).

— Matt Parker
источник

5

Mathematically if you start with the simple assumption that the probability of an event occurring in a defined interval $T = 1$ is $\lambda$ you can show the expected number of events in the interval $T = t$ is is $\lambda.t$ , the variance is also $\lambda.t$ and the probability distribution is

p (N = n) = \frac{(λ . t)^{n} e^{- λ . t}}{n!}

$p(N=n) = \frac{(\lambda.t)^{n}e^{-\lambda.t}}{n!}$

Via this and the maximum likelihood method & generalised linear models (or some other method) you arrive at Poisson regression.

In simple terms Poisson Regression is the model that fits the assumptions of the underlying random process generating a small number of events at a rate (i.e. number per unit time) determined by other variables in the model.

— Thylacoleo
источник

3

Others have basically said the same thing I'm going to but I thought I'd add my take on it. It depends on what you're doing exactly but a lot of times we like to conceptualize the problem/data at hand. This is a slightly different approach compared to just building a model that predicts pretty well. If we are trying to conceptualize what's going on it makes sense to model count data using a non-negative distribution that only puts mass at integer values. We also have many results that essentially boil down to saying that under certain conditions count data really is distributed as a poisson. So if our goal is to conceptualize the problem it really makes sense to use a poisson as the response variable. Others have pointed out other reasons why it's a good idea but if you're really trying to conceptualize the problem and really understand how data that you see could be generated then using a poisson regression makes a lot of sense in some situations.

— Dason
источник

2

My understanding is primarily because counts are always positive and discrete, the Poisson can summarize such data with one parameter. The main catch being that the variance equals the mean.