logistic regression imbalanced data analysis
logistic regression
caojilin
5/11/2019
# multivariate normal data generating
# first experiment, very imbalenced two classes datasets
set.seed(2)
num_class1 = 1000
num_class2 = 20
X = matrix(rep(0, (num_class1 + num_class2)*2), ncol=2)
mu1 = c(-2,-2)
sigma1 = diag(2)
mu2 = c(1,1)
sigma2 = matrix(c(1,0,0,2) ,2,2, byrow = T)
for (i in 1:num_class1) {
obsev = mvrnorm(1, mu1, sigma1)
X[i,1] = obsev[1]
X[i,2] = obsev[2]
}
for (i in (num_class1+1):(num_class1+num_class2)) {
obsev = mvrnorm(1, mu2, sigma2)
X[i,1] = obsev[1]
X[i,2] = obsev[2]
}
label = c(rep(1,num_class1), rep(0,num_class2))
plot(X, col=label+2, main = "Original")
logistic.model = glm(label ~ X , family = "binomial")
summary(logistic.model)
##
## Call:
## glm(formula = label ~ X, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.13907 0.00776 0.02143 0.05458 1.91459
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7480 0.4652 1.608 0.108
## X1 -1.6758 0.3665 -4.572 4.82e-06 ***
## X2 -2.1960 0.4685 -4.687 2.77e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 196.878 on 1019 degrees of freedom
## Residual deviance: 52.963 on 1017 degrees of freedom
## AIC: 58.963
##
## Number of Fisher Scoring iterations: 9
y_hat_prob = predict(logistic.model, data.frame(X), type = "response")
scores.class0 = y_hat_prob[label == 1]
scores.class1 = y_hat_prob[label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)
y_hat = y_hat_prob > 0.5
mean(y_hat == label)
## [1] 0.9911765
traning accuracy, as you can see, pretty high. However, it doesn’t mean anything, because even a trivial classifier can get pretty high accurcy Why do we care about this? Say one of our class is for rare events, such as disease Among polutation, the percentage for ppl have disease is very low, for example, 1% You can’t just say, well, my classifier have a 99% accuracy for detecting disease. Instead of accuracy(which is one of the many metrics), we care more about recall hence sacrifice the precision. Why? we have to make sure we can identify (almost) all of potential ppl with disease. It’s gonna harm less to have more false positive instead of more false negative. The way we do this is to adjust the threshold to countrol the recall.
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 13 2
## 1 7 998
##
## Accuracy : 0.9912
## 95% CI : (0.9833, 0.996)
## No Information Rate : 0.9804
## P-Value [Acc > NIR] : 0.004687
##
## Kappa : 0.7385
##
## Mcnemar's Test P-Value : 0.182422
##
## Sensitivity : 0.65000
## Specificity : 0.99800
## Pos Pred Value : 0.86667
## Neg Pred Value : 0.99303
## Prevalence : 0.01961
## Detection Rate : 0.01275
## Detection Prevalence : 0.01471
## Balanced Accuracy : 0.82400
##
## 'Positive' Class : 0
##
In this case, 0 means sick people (positive) , 1 means people are healthy (negative).
We have total 20 diseased people
recall = 0.65000
among all sick people, we only find 65% of them, which means
we missed 35% of sick people, this is really bad.
precision = 13/15 = 0.8666667, among people whom we think are sick, 86% of them
are truly sick.
True positive: Sick people correctly identified as sick
False positive: Healthy people incorrectly identified as sick
True negative: Healthy people correctly identified as healthy
False negative: Sick people incorrectly identified as healthy
hypothesis testing: is this person healthy? The null hypothesis is: this person is healthy
The alternative is: this person is sick
Type-I error: when the null hypothesis is true, we reject the null
The probability of a Type-I error is called the significance level, usually denoted by \(\alpha\). This is the threshold we can pick and control. It can be 0.05 or 0.01. Then p-value is the probability that a statistic summary is greater than or equal to the observed, under the null is true. We don’t have test statistic yet.
Type-II error: when the null hyopthesis is false, we accept the null
Now, let’s try a different theshold
# now our threshold become more strick
y_hat = y_hat_prob > 0.95
mean(y_hat == label)
## [1] 0.9794118
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 19 20
## 1 1 980
##
## Accuracy : 0.9794
## 95% CI : (0.9687, 0.9872)
## No Information Rate : 0.9804
## P-Value [Acc > NIR] : 0.6445
##
## Kappa : 0.6346
##
## Mcnemar's Test P-Value : 8.568e-05
##
## Sensitivity : 0.95000
## Specificity : 0.98000
## Pos Pred Value : 0.48718
## Neg Pred Value : 0.99898
## Prevalence : 0.01961
## Detection Rate : 0.01863
## Detection Prevalence : 0.03824
## Balanced Accuracy : 0.96500
##
## 'Positive' Class : 0
##
Now, our accuracy has decreased from 0.9911765 to 0.9794118
recall = 0.95000, among all sick people, we identified 0.95 of them.
precision = 0.48718, among people whom what we think are sick, only half of them are really sick
For healthy people who are identified as sick, we can conduct more medical test. However if we missed some potential sick people, it’s hard to get them back, and their disease will get even worse as days go by.
We sacrificed precion in exchange of recall.
Now, let’s see the effect of sample size
set.seed(2)
num_class1 = 100000
num_class2 = 2000
X = matrix(rep(0, (num_class1 + num_class2)*2), ncol=2)
mu1 = c(-2,-2)
sigma1 = diag(2)
mu2 = c(1,1)
sigma2 = matrix(c(1,0,0,2) ,2,2, byrow = T)
for (i in 1:num_class1) {
obsev = mvrnorm(1, mu1, sigma1)
X[i,1] = obsev[1]
X[i,2] = obsev[2]
}
for (i in (num_class1+1):(num_class1+num_class2)) {
obsev = mvrnorm(1, mu2, sigma2)
X[i,1] = obsev[1]
X[i,2] = obsev[2]
}
label = c(rep(1,num_class1), rep(0,num_class2))
plot(X, col=label+2, main = "Original")
logistic.model = glm(label ~ X , family = "binomial")
summary(logistic.model)
##
## Call:
## glm(formula = label ~ X, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6695 0.0014 0.0051 0.0180 2.9227
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.83972 0.05698 14.74 <2e-16 ***
## X1 -2.90964 0.07275 -40.00 <2e-16 ***
## X2 -2.24953 0.06125 -36.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 19687.8 on 101999 degrees of freedom
## Residual deviance: 2724.2 on 101997 degrees of freedom
## AIC: 2730.2
##
## Number of Fisher Scoring iterations: 11
y_hat_prob = predict(logistic.model, data.frame(X), type = "response")
scores.class0 = y_hat_prob[label == 1]
scores.class1 = y_hat_prob[label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)
y_hat = y_hat_prob > 0.5
mean(y_hat == label)
## [1] 0.9956765
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1688 129
## 1 312 99871
##
## Accuracy : 0.9957
## 95% CI : (0.9953, 0.9961)
## No Information Rate : 0.9804
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8823
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.84400
## Specificity : 0.99871
## Pos Pred Value : 0.92900
## Neg Pred Value : 0.99689
## Prevalence : 0.01961
## Detection Rate : 0.01655
## Detection Prevalence : 0.01781
## Balanced Accuracy : 0.92135
##
## 'Positive' Class : 0
##
#strick threshold
y_hat = y_hat_prob > 0.95
mean(y_hat == label)
## [1] 0.9867549
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1903 1254
## 1 97 98746
##
## Accuracy : 0.9868
## 95% CI : (0.986, 0.9874)
## No Information Rate : 0.9804
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7316
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.95150
## Specificity : 0.98746
## Pos Pred Value : 0.60279
## Neg Pred Value : 0.99902
## Prevalence : 0.01961
## Detection Rate : 0.01866
## Detection Prevalence : 0.03095
## Balanced Accuracy : 0.96948
##
## 'Positive' Class : 0
##
now let’s divide data into two parts and see the prediction power
total_size = num_class1 + num_class2
train.index = sample(1:total_size, 0.8*total_size)
train.feature = X[train.index,]
train.label = label[train.index]
test.feature = X[-train.index,]
test.label = label[-train.index]
# check the ratio of two classes in train and test datasets
table(train.label)
## train.label
## 0 1
## 1616 79984
table(test.label)
## test.label
## 0 1
## 384 20016
# they are ok. both have around 0.2 positive(sick people) class
logistic.model = glm(train.label ~ X1 + X2,
data=data.frame(train.feature),
family = "binomial")
summary(logistic.model)
##
## Call:
## glm(formula = train.label ~ X1 + X2, family = "binomial", data = data.frame(train.feature))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6665 0.0014 0.0050 0.0180 2.9453
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.81650 0.06427 12.70 <2e-16 ***
## X1 -2.93452 0.08168 -35.93 <2e-16 ***
## X2 -2.24043 0.06838 -32.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 15875.3 on 81599 degrees of freedom
## Residual deviance: 2179.2 on 81597 degrees of freedom
## AIC: 2185.2
##
## Number of Fisher Scoring iterations: 11
y_hat_prob = predict(logistic.model, data.frame(train.feature), type = "response")
scores.class0 = y_hat_prob[train.label == 1]
scores.class1 = y_hat_prob[train.label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)
y_hat = y_hat_prob > 0.5
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1368 103
## 1 248 79881
##
## Accuracy : 0.9957
## 95% CI : (0.9952, 0.9961)
## No Information Rate : 0.9802
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8841
##
## Mcnemar's Test P-Value : 1.516e-14
##
## Sensitivity : 0.84653
## Specificity : 0.99871
## Pos Pred Value : 0.92998
## Neg Pred Value : 0.99690
## Prevalence : 0.01980
## Detection Rate : 0.01676
## Detection Prevalence : 0.01803
## Balanced Accuracy : 0.92262
##
## 'Positive' Class : 0
##
0.5 threshold gives us recall = 0.83871, and precision = 0.92727
y_hat = y_hat_prob > 0.95
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1537 994
## 1 79 78990
##
## Accuracy : 0.9869
## 95% CI : (0.986, 0.9876)
## No Information Rate : 0.9802
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7348
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.95111
## Specificity : 0.98757
## Pos Pred Value : 0.60727
## Neg Pred Value : 0.99900
## Prevalence : 0.01980
## Detection Rate : 0.01884
## Detection Prevalence : 0.03102
## Balanced Accuracy : 0.96934
##
## 'Positive' Class : 0
##
cutoff = 0.95
now recall=0.95066, precision=0.59667
remember our task is to correctly identify sick people more much as possible
so we want high recall.
to test this cutoff
y_hat_prob = predict(logistic.model, newdata=data.frame(test.feature), type = "response")
y_hat = y_hat_prob > 0.95
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(test.label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 368 272
## 1 16 19744
##
## Accuracy : 0.9859
## 95% CI : (0.9842, 0.9875)
## No Information Rate : 0.9812
## P-Value [Acc > NIR] : 1.406e-07
##
## Kappa : 0.712
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.95833
## Specificity : 0.98641
## Pos Pred Value : 0.57500
## Neg Pred Value : 0.99919
## Prevalence : 0.01882
## Detection Rate : 0.01804
## Detection Prevalence : 0.03137
## Balanced Accuracy : 0.97237
##
## 'Positive' Class : 0
##
So on the test dataset, we still get 0.95704 recall, which is good.
reference: https://statisticalhorizons.com/logistic-regression-for-rare-events Sample size matters.
“you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.”
“The problem is not that the classes are imbalanced per se, it is that there may not be sufficient patterns belonging to the minority class to adequately represent its distribution. This means that the problem can arise for any classifier (even if you have a synthetic problem and you know you have the true model), not just logistic regression. The good thing is that as more data become available, the”class imbalance" problem usually goes away."
let’s try samller sample size
set.seed(2)
num_class1 = 10000
num_class2 = 200
X = matrix(rep(0, (num_class1 + num_class2)*2), ncol=2)
mu1 = c(-2,-2)
sigma1 = diag(2)
mu2 = c(1,1)
sigma2 = matrix(c(1,0,0,2) ,2,2, byrow = T)
for (i in 1:num_class1) {
obsev = mvrnorm(1, mu1, sigma1)
X[i,1] = obsev[1]
X[i,2] = obsev[2]
}
for (i in (num_class1+1):(num_class1+num_class2)) {
obsev = mvrnorm(1, mu2, sigma2)
X[i,1] = obsev[1]
X[i,2] = obsev[2]
}
label = c(rep(1,num_class1), rep(0,num_class2))
plot(X, col=label+2, main = "Original")
now let’s divide data into two parts and see the prediction power
total_size = num_class1 + num_class2
train.index = sample(1:total_size, 0.8*total_size)
train.feature = X[train.index,]
train.label = label[train.index]
test.feature = X[-train.index,]
test.label = label[-train.index]
# check the ratio of two classes in train and test datasets
table(train.label)
## train.label
## 0 1
## 161 7999
table(test.label)
## test.label
## 0 1
## 39 2001
# they are ok. both have around 0.2 positive(sick people) class
logistic.model = glm(train.label ~ X1 + X2,
data=data.frame(train.feature),
family = "binomial")
summary(logistic.model)
##
## Call:
## glm(formula = train.label ~ X1 + X2, family = "binomial", data = data.frame(train.feature))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6963 0.0016 0.0055 0.0188 2.4608
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8350 0.1913 4.365 1.27e-05 ***
## X1 -2.6488 0.2391 -11.077 < 2e-16 ***
## X2 -2.4457 0.2219 -11.020 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1582.84 on 8159 degrees of freedom
## Residual deviance: 225.57 on 8157 degrees of freedom
## AIC: 231.57
##
## Number of Fisher Scoring iterations: 11
y_hat_prob = predict(logistic.model, data.frame(train.feature), type = "response")
scores.class0 = y_hat_prob[train.label == 1]
scores.class1 = y_hat_prob[train.label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)
y_hat = y_hat_prob > 0.5
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 136 11
## 1 25 7988
##
## Accuracy : 0.9956
## 95% CI : (0.9939, 0.9969)
## No Information Rate : 0.9803
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8809
##
## Mcnemar's Test P-Value : 0.03026
##
## Sensitivity : 0.84472
## Specificity : 0.99862
## Pos Pred Value : 0.92517
## Neg Pred Value : 0.99688
## Prevalence : 0.01973
## Detection Rate : 0.01667
## Detection Prevalence : 0.01801
## Balanced Accuracy : 0.92167
##
## 'Positive' Class : 0
##
y_hat = y_hat_prob > 0.95
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 153 102
## 1 8 7897
##
## Accuracy : 0.9865
## 95% CI : (0.9838, 0.9889)
## No Information Rate : 0.9803
## P-Value [Acc > NIR] : 1.093e-05
##
## Kappa : 0.729
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.95031
## Specificity : 0.98725
## Pos Pred Value : 0.60000
## Neg Pred Value : 0.99899
## Prevalence : 0.01973
## Detection Rate : 0.01875
## Detection Prevalence : 0.03125
## Balanced Accuracy : 0.96878
##
## 'Positive' Class : 0
##
y_hat_prob = predict(logistic.model, newdata=data.frame(test.feature), type = "response")
y_hat = y_hat_prob > 0.95
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(test.label))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 26
## 1 2 1975
##
## Accuracy : 0.9863
## 95% CI : (0.9802, 0.9909)
## No Information Rate : 0.9809
## P-Value [Acc > NIR] : 0.03964
##
## Kappa : 0.7189
##
## Mcnemar's Test P-Value : 1.383e-05
##
## Sensitivity : 0.94872
## Specificity : 0.98701
## Pos Pred Value : 0.58730
## Neg Pred Value : 0.99899
## Prevalence : 0.01912
## Detection Rate : 0.01814
## Detection Prevalence : 0.03088
## Balanced Accuracy : 0.96786
##
## 'Positive' Class : 0
##