Вычисленный вручную

Я знаю, что это довольно специфический Rвопрос, но я могу думать о неправильной пропорции, объясненной, $R^2$ . Вот оно.

Я пытаюсь использовать Rпакет randomForest. У меня есть некоторые тренировочные данные и данные тестирования. Когда я подгоняю модель случайного леса, randomForestфункция позволяет вам вводить новые данные тестирования для тестирования. Затем он сообщает вам процент дисперсии, объясненный в этих новых данных. Когда я смотрю на это, я получаю один номер.

Когда я использую predict()функцию для прогнозирования значения результата данных тестирования на основе соответствия модели из данных обучения, и я беру квадратный коэффициент корреляции между этими значениями и фактическими значениями результата для данных тестирования, я получаю другое число. Эти значения не совпадают .

Вот некоторый Rкод, чтобы продемонстрировать проблему.

# use the built in iris data
data(iris)

#load the randomForest library
library(randomForest)

# split the data into training and testing sets
index <- 1:nrow(iris)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- iris[trainindex, ]
testset <- iris[-trainindex, ]

# fit a model to the training set (column 1, Sepal.Length, will be the outcome)
set.seed(42)
model <- randomForest(x=trainset[ ,-1],y=trainset[ ,1])

# predict values for the testing set (the first column is the outcome, leave it out)
predicted <- predict(model, testset[ ,-1])

# what's the squared correlation coefficient between predicted and actual values?
cor(predicted, testset[, 1])^2

# now, refit the model using built-in x.test and y.test
set.seed(42)
randomForest(x=trainset[ ,-1], y=trainset[ ,1], xtest=testset[ ,-1], ytest=testset[ ,1])

— Стивен Тернер
источник

Причина, по которой значения не совпадают, заключается в том, что сообщается об изменении, объясненном, а не об объясненном отклонении . Я думаю, что это распространенное заблуждение о , которое закреплено в учебниках. Я даже упоминал об этом в другой теме на днях. Если вам нужен пример, см. (В остальном неплохой) учебник «Себер и Ли», « Линейный регрессионный анализ» , 2-й. редактор $R^2$ randomForest $R^2$

Общее определение для представляет $R^2$

R^{2} = 1 - \frac{\sum_{i} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i} (y_{i} - \bar{y})^{2}} .

$R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} .$

То есть мы вычисляем среднеквадратичную ошибку, делим ее на дисперсию исходных наблюдений и затем вычитаем ее из единицы. (Обратите внимание, что если ваши прогнозы действительно плохие, это значение может стать отрицательным.)

$\hat{y}_i$ матчи $\bar{y}$ , Кроме того, остаточный вектор $y - \hat{y}$ is orthogonal to the vector of fitted values $\hat{y}$ . When you put these two things together, then the definition reduces to the one that is more commonly encountered, i.e.,

R_{L R}^{2} = C o r r (y, \hat{y})^{2} .

$R^2_{\mathrm{LR}} = \mathrm{Corr}(y,\hat{y})^2 .$ (I've used the subscripts

L R

$\mathrm{LR}$ in

R_{L R}^{2}

$R^2_{\mathrm{LR}}$ to indicate linear regression.)

The randomForest call is using the first definition, so if you do

   > y <- testset[,1]
   > 1 - sum((y-predicted)^2)/sum((y-mean(y))^2)

you'll see that the answers match.

— cardinal
источник

+1, great answer. I always wondered why the original formula is used for

R^{2}

$R^2$ instead of square of correlation. For linear regression it is the same, but when applied to other contexts it is always confusing.

— mpiktas

(+1) Very elegant response, indeed.

— chl

@mpiktas, @chl, I'll try to expand on this a little more later today. Basically, there's a close (but, perhaps, slightly hidden) connection to hypothesis testing in the background. Even in a linear regression setting, if the constant vector is not in the column space of the design matrix, then the "correlation" definition will fail.

— cardinal

If you have a reference other than the Seber/Lee textbook (not accessible to me) I would love to see a good explanation of how variation explained (i.e. 1-SSerr/SStot) differs from the squared correlation coefficient, or variance explained. Thanks again for the tip.

— Stephen Turner

If the R-squared value is negative from the instrumental variable regression results, is there a way to supress this negative value and translate into a positive value for the sake of reporting? Refer to this link please: stata.com/support/faqs/statistics/two-stage-least-squares

— Eric