В «Наивном байесовском», зачем беспокоиться о сглаживании Лапласа, когда в тестовом наборе есть неизвестные слова?

28

Я читал сегодня наивную байесовскую классификацию. Я прочитал под заголовком Оценка параметров с добавлением сглаживания 1 :

Пусть $c$ ссылается на класс (например, Positive или Negative), а указывает на токен или слово. $w$

Оценка максимального правдоподобия для : $P(w|c)$
$\frac{c o u n t (w, c)}{c o u n t (c)} = \frac{counts w in class c}{counts of words in class c} .$ $\frac{count(w,c)}{count(c)} = \frac{\text{counts w in class c}}{\text{counts of words in class c}}.$

Эта оценка может быть проблематичной, поскольку она даст нам вероятность для документов с неизвестными словами. Распространенным способом решения этой проблемы является использование сглаживания Лапласа. $P(w|c)$ $0$

Пусть V будет набором слов в обучающем наборе, добавьте новый элемент (для неизвестного) в набор слов. $UNK$

Определить
$P (w | c) = \frac{count (w, c) + 1}{count (c) + | V | + 1},$ $P(w|c)=\frac{\text{count}(w,c) +1}{\text{count}(c) + |V| + 1},$

где относится к словарному запасу (слова в обучающем наборе). $V$

В частности, любое неизвестное слово будет иметь вероятность
$\frac{1}{подсчитывать (с) + | В | + 1},$ $\frac{1}{\text{count}(c) + |V| + 1}.$

У меня такой вопрос: почему мы вообще беспокоимся об этом сглаживании Лапласа? Если эти неизвестные слова, с которыми мы сталкиваемся в тестовом наборе, имеют вероятность, которая почти равна нулю, то есть , какой смысл включать их в модель? Почему бы просто не игнорировать и не удалять их? $\frac{1}{\text{count}(c) + |V| + 1}$

— Мэтт О'Брайен
источник

3

Если вы этого не сделаете, то любое заявление, содержащее ранее невидимое слово, будет иметь

. Это означает, что произошло невозможное событие. Что означает, что ваша модель была невероятно плохо подходит. Также в правильной байесовской модели это никогда не может произойти, поскольку вероятность неизвестного слова будет иметь числитель, заданный предыдущим (возможно, не 1). Поэтому я не знаю, почему для этого требуется причудливое название «Сглаживание по Лапласу».

p = 0

$p=0$

— предположения

1

Из какого текста пришло чтение?

— словами:

17

Вам всегда нужна эта «безотказная» вероятность.

Чтобы понять, почему рассмотрим наихудший случай, когда ни одно из слов в обучающей выборке не появляется в тестовом предложении. В этом случае, согласно вашей модели, мы пришли бы к выводу, что предложение невозможно, но оно явно существует, создавая противоречие.

Другой крайний пример - тестовое предложение «Алекс встретил Стива». где «встретился» появляется несколько раз в тренировочном образце, а «Алекс» и «Стив» - нет. Ваша модель пришла бы к выводу, что это утверждение очень вероятно, что не соответствует действительности.

— Sid
источник

Ненавижу звучать как полный дебил, но не могли бы вы уточнить? Как удаление слов «Алекс» и «Стив» изменяет вероятность появления заявления?

— Мэтт О'Брайен

2

Если предположить независимость слов P (Алекс) P (Стив) P (встретился) << P (встретился)

— Сид

1

мы могли бы создать словарь при обучении модели на наборе обучающих данных, так почему бы просто не удалить все новые слова, не встречающиеся в словаре, когда делаются прогнозы на наборе тестовых данных?

— авокадо

15

Допустим, вы обучили свой наивный байесовский классификатор 2 классам: «Хэм» и «Спам» (т.е. он классифицирует электронные письма). Для простоты мы примем предыдущие вероятности равными 50/50.

Теперь предположит, что у вас есть электронная почта $(w_1, w_2,...,w_n)$ , которые ваши ставки классификаторов очень высоко , как «Хам», скажем ,

P (H a m | w_{1}, w_{2}, . . . w_{n}) = .90

$P(Ham|w_1,w_2,...w_n) = .90$ и

P (S p a m | w_{1}, w_{2}, . . w_{n}) = .10

$P(Spam|w_1,w_2,..w_n) = .10$

Все идет нормально.

Теперь предположим, что у вас есть еще одно сообщение $(w_1, w_2, ...,w_n,w_{n+1})$ , который точно так же , как выше по электронной почте за исключением того, что есть одно слово в нем , что не входит в словарный запас , Следовательно, поскольку счет этого слова равен 0,

P (H a m | w_{n + 1}) = P (S p a m | w_{n + 1}) = 0

$P(Ham|w_{n+1}) = P(Spam|w_{n+1}) = 0$

Неожиданно,

P (H a m | w_{1}, w_{2}, . . . w_{n}, w_{n + 1}) = P (H a m | w_{1}, w_{2}, . . . w_{n}) * P (H a m | w_{n + 1}) = 0

$P(Ham|w_1,w_2,...w_n,w_{n+1}) = P(Ham|w_1,w_2,...w_n) * P(Ham|w_{n+1}) = 0$ и

P (S p a m | w_{1}, w_{2}, . . w_{n}, w_{n + 1}) = P (S p a m | w_{1}, w_{2}, . . . w_{n}) * P (S p a m | w_{n + 1}) = 0

$P(Spam|w_1,w_2,..w_n,w_{n+1}) = P(Spam|w_1,w_2,...w_n) * P(Spam|w_{n+1}) = 0$

Несмотря на то, что 1-е электронное письмо строго классифицировано в одном классе, это 2-е электронное письмо может быть классифицировано по-разному, поскольку последнее слово имеет вероятность нуля.

Сглаживание Лапласа решает эту проблему, давая последнему слову небольшую ненулевую вероятность для обоих классов, чтобы последующие вероятности внезапно не упали до нуля.

— РВК
источник

почему мы держим слово, которого вообще нет в словаре? почему бы просто не удалить его?

— Авокадо

4

если ваш классификатор оценивает электронную почту как вероятную, что это Ham, то p (ham | w1, ..., wn) равно 0,9, а не p (w1, ..., wn | ham)

— braaterAfrikaaner

5

Этот вопрос довольно прост, если вы знакомы с оценками Байеса, так как это прямой вывод оценки Байеса.

В байесовском подходе параметры считаются величиной, изменение которой может быть описано распределением вероятностей (или предшествующим распределением).

Итак, если мы рассматриваем процедуру выбора как многочленное распределение, то мы можем решить вопрос за несколько шагов.

Сначала определите

m = | V |, n = \sum n_{i}

$m = |V|, n = \sum n_i$

$p_i$

p (p_{1}, p_{2}, . . ., p_{m} | n_{1}, n_{2}, . . ., n_{m}) = \frac{Γ (n + m)}{\prod_{i = 1}^{m} Γ (n_{i} + 1)} \prod_{i = 1}^{m} p_{i}^{n_{i}}

$p(p_1,p_2,...,p_m|n_1,n_2,...,n_m) = \frac{\Gamma(n+m)}{\prod\limits_{i=1}^{m}\Gamma(n_i+1)}\prod\limits_{i=1}^{m}p_i^{n_i}$

$p_i$

E [p_{i}] = \frac{n_{i} + 1}{n + m}

$E[p_i] = \frac{n_i+1}{n+m}$

$p_i$ $p_i$

{\hat{p}}_{i} = E [p_{i}]

$\hat p_i = E[p_i]$

Вы можете видеть, что мы просто делаем тот же вывод, что и сглаживание Лапласа.

— Response777
источник

4

Игнорирование этих слов - еще один способ справиться с этим. Это соответствует усреднению (интегрированию) по всем отсутствующим переменным. Так что результат другой. Как?

P (C^{*} | d) = \arg max_{C} \frac{\prod_{i} p (t_{i} | C) P (C)}{P (d)} \propto \arg max_{C} \prod_{i} p (t_{i} | C) P (C)

$P(C^{*}|d) = \arg\max_{C} \frac{\prod_{i}p(t_{i}|C)P(C)}{P(d)} \propto \arg\max_{C} \prod_{i}p(t_{i}|C)P(C)$ where

t_{i}

$t_{i}$ are the tokens in the vocabulary and

d

$d$ is a document.

Let say token $t_{k}$ does not appear. Instead of using a Laplace smoothing (which comes from imposing a Dirichlet prior on the multinomial Bayes), you sum out $t_{k}$ which corresponds to saying: I take a weighted voting over all possibilities for the unknown tokens (having them or not).

P (C^{*} | d) \propto \arg max_{C} \sum_{t_{k}} \prod_{i} p (t_{i} | C) P (C) = \arg max_{C} P (C) \prod_{i \neq k} p (t_{i} | C) \sum_{t_{k}} p (t_{k} | C) = \arg max_{C} P (C) \prod_{i \neq k} p (t_{i} | C)

$P(C^{*}|d) \propto \arg\max_{C} \sum_{t_{k}} \prod_{i}p(t_{i}|C)P(C) = \arg\max_{C} P(C)\prod_{i \neq k}p(t_{i}|C) \sum_{t_{k}} p(t_{k}|C) = \arg\max_{C} P(C)\prod_{i \neq k}p(t_{i}|C)$

But in practice one prefers the smoothing approach. Instead of ignoring those tokens, you assign them a low probability which is like thinking: if I have unknown tokens, it is more unlikely that is the kind of document I'd otherwise think it is.

— jpmuc
источник

2

You want to know why we bother with smoothing at all in a Naive Bayes classifier (when we can throw away the unknown features instead).

The answer to your question is: not all words have to be unknown in all classes.

Say there are two classes M and N with features A, B and C, as follows:

M: A=3, B=1, C=0

(In the class M, A appears 3 times and B only once)

N: A=0, B=1, C=3

(In the class N, C appears 3 times and B only once)

Let's see what happens when you throw away features that appear zero times.

A) Throw Away Features That Appear Zero Times In Any Class

If you throw away features A and C because they appear zero times in any of the classes, then you are only left with feature B to classify documents with.

And losing that information is a bad thing as you will see below!

If you're presented with a test document as follows:

B=1, C=3

(It contains B once and C three times)

Now, since you've discarded the features A and B, you won't be able to tell whether the above document belongs to class M or class N.

So, losing any feature information is a bad thing!

B) Throw Away Features That Appear Zero Times In All Classes

Is it possible to get around this problem by discarding only those features that appear zero times in all of the classes?

No, because that would create its own problems!

The following test document illustrates what would happen if we did that:

A=3, B=1, C=1

The probability of M and N would both become zero (because we did not throw away the zero probability of A in class N and the zero probability of C in class M).

C) Don't Throw Anything Away - Use Smoothing Instead

Smoothing allows you to classify both the above documents correctly because:

You do not lose count information in classes where such information is available and
You do not have to contend with zero counts.

Naive Bayes Classifiers In Practice

The Naive Bayes classifier in NLTK used to throw away features that had zero counts in any of the classes.

This used to make it perform poorly when trained using a hard EM procedure (where the classifier is bootstrapped up from very little training data).

— Aiaioo Labs
источник

2

@ Aiaioo Labs You failed to realize that he was referring to words that did not appear in the training set at all, for your example, he was referring to say if D appeared, the issue isn't with laplace smoothing on the calculations from the training set rather the test set. Using laplace smoothing on unknown words from the TEST set causes probability to be skewed towards whichever class had the least amount of tokens due to 0 + 1 / 2 + 3 being bigger that 0 + 1 / 3 + 3 (if one of the classes had 3 tokens and the other had 2). ...

2

Это может на самом деле превратить правильную классификацию в неправильную классификацию, если в уравнение сглажено достаточное количество неизвестных слов. Сглаживание по Лапласу подходит для вычислений обучающих наборов, но вредно для анализа тестовых наборов. Также представьте, что у вас есть набор тестов со всеми неизвестными словами, он должен быть немедленно классифицирован с классом с наибольшей вероятностью, но на самом деле он может и обычно не классифицируется как таковой, и обычно классифицируется как класс с наименьшим количеством токенов.

@DrakeThatcher, highly agree with you, yes if we don't remove words not in vocabulary, then predicted proba will be skewed to class with least amount of words.

— avocado

1

I also came across the same problem while studying Naive Bayes.

According to me, whenever we encounter a test example which we hadn't come across during training, then out Posterior probability will become 0.

So adding the 1 , even if we never train on a particular feature/class, the Posterior probability will never be 0.

— Sarthak Khanna
источник

1

Matt you are correct you raise a very good point - yes Laplace Smoothing is quite frankly nonsense! Just simply throwing away those features can be a valid approach, particularly when the denominator is also a small number - there is simply not enough evidence to support the probability estimation.

I have a strong aversion to solving any problem via use of some arbitrary adjustment. The problem here is zeros, the "solution" is to just "add some small value to zero so it's not zero anymore - MAGIC the problem is no more". Of course that's totally arbitrary.

Your suggestion of better feature selection to begin with is a less arbitrary approach and IME increases performance. Furthermore Laplace Smoothing in conjunction with naive Bayes as the model has in my experience worsens the granularity problem - i.e. the problem where scores output tend to be close to 1.0 or 0.0 (if the number of features is infinite then every score will be 1.0 or 0.0 - this is a consequence of the independence assumption).

Now alternative techniques for probability estimation exist (other than max likelihood + Laplace smoothing), but are massively under documented. In fact there is a whole field called Inductive Logic and Inference Processes that use a lot of tools from Information Theory.

What we use in practice is of Minimum Cross Entropy Updating which is an extension of Jeffrey's Updating where we define the convex region of probability space consistent with the evidence to be the region such that a point in it would mean the Maximum Likelihood estimation is within the Expected Absolute Deviation from the point.

This has a nice property that as the number of data points decreases the estimations peace-wise smoothly approach the prior - and therefore their effect in the Bayesian calculation is null. Laplace smoothing on the other hand makes each estimation approach the point of Maximum Entropy that may not be the prior and therefore the effect in the calculation is not null and will just add noise.

— samthebest
источник