logistic regression imbalanced data analysis

logistic regression

# multivariate normal data generating 
# first experiment, very imbalenced two classes datasets
set.seed(2)

num_class1 = 1000
num_class2 = 20
X = matrix(rep(0, (num_class1 + num_class2)*2), ncol=2)
mu1  = c(-2,-2)
sigma1 = diag(2)
mu2  = c(1,1)
sigma2 = matrix(c(1,0,0,2) ,2,2, byrow = T)

for (i in 1:num_class1) {
    obsev = mvrnorm(1, mu1, sigma1)
    X[i,1] = obsev[1]
    X[i,2] = obsev[2]
}
for (i in (num_class1+1):(num_class1+num_class2)) {
  obsev = mvrnorm(1, mu2, sigma2)
  X[i,1] = obsev[1]
  X[i,2] = obsev[2]
}

label = c(rep(1,num_class1), rep(0,num_class2))

plot(X, col=label+2, main = "Original")

logistic.model = glm(label ~ X , family = "binomial")
summary(logistic.model)

## 
## Call:
## glm(formula = label ~ X, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.13907   0.00776   0.02143   0.05458   1.91459  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.7480     0.4652   1.608    0.108    
## X1           -1.6758     0.3665  -4.572 4.82e-06 ***
## X2           -2.1960     0.4685  -4.687 2.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 196.878  on 1019  degrees of freedom
## Residual deviance:  52.963  on 1017  degrees of freedom
## AIC: 58.963
## 
## Number of Fisher Scoring iterations: 9

y_hat_prob = predict(logistic.model, data.frame(X), type = "response")
scores.class0 = y_hat_prob[label == 1]
scores.class1 = y_hat_prob[label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)

y_hat = y_hat_prob > 0.5
mean(y_hat == label)

## [1] 0.9911765

traning accuracy, as you can see, pretty high. However, it doesn’t mean anything, because even a trivial classifier can get pretty high accurcy Why do we care about this? Say one of our class is for rare events, such as disease Among polutation, the percentage for ppl have disease is very low, for example, 1% You can’t just say, well, my classifier have a 99% accuracy for detecting disease. Instead of accuracy(which is one of the many metrics), we care more about recall hence sacrifice the precision. Why? we have to make sure we can identify (almost) all of potential ppl with disease. It’s gonna harm less to have more false positive instead of more false negative. The way we do this is to adjust the threshold to countrol the recall.

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  13   2
##          1   7 998
##                                          
##                Accuracy : 0.9912         
##                  95% CI : (0.9833, 0.996)
##     No Information Rate : 0.9804         
##     P-Value [Acc > NIR] : 0.004687       
##                                          
##                   Kappa : 0.7385         
##                                          
##  Mcnemar's Test P-Value : 0.182422       
##                                          
##             Sensitivity : 0.65000        
##             Specificity : 0.99800        
##          Pos Pred Value : 0.86667        
##          Neg Pred Value : 0.99303        
##              Prevalence : 0.01961        
##          Detection Rate : 0.01275        
##    Detection Prevalence : 0.01471        
##       Balanced Accuracy : 0.82400        
##                                          
##        'Positive' Class : 0              
##

In this case, 0 means sick people (positive) , 1 means people are healthy (negative).

We have total 20 diseased people
recall = 0.65000
among all sick people, we only find 65% of them, which means
we missed 35% of sick people, this is really bad.
precision = 13/15 = 0.8666667, among people whom we think are sick, 86% of them
are truly sick.

True positive: Sick people correctly identified as sick
False positive: Healthy people incorrectly identified as sick
True negative: Healthy people correctly identified as healthy
False negative: Sick people incorrectly identified as healthy

hypothesis testing: is this person healthy? The null hypothesis is: this person is healthy
The alternative is: this person is sick

Type-I error: when the null hypothesis is true, we reject the null
The probability of a Type-I error is called the significance level, usually denoted by \(\alpha\). This is the threshold we can pick and control. It can be 0.05 or 0.01. Then p-value is the probability that a statistic summary is greater than or equal to the observed, under the null is true. We don’t have test statistic yet.
Type-II error: when the null hyopthesis is false, we accept the null

Now, let’s try a different theshold

# now our threshold become more strick
y_hat = y_hat_prob > 0.95
mean(y_hat == label)

## [1] 0.9794118

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  19  20
##          1   1 980
##                                           
##                Accuracy : 0.9794          
##                  95% CI : (0.9687, 0.9872)
##     No Information Rate : 0.9804          
##     P-Value [Acc > NIR] : 0.6445          
##                                           
##                   Kappa : 0.6346          
##                                           
##  Mcnemar's Test P-Value : 8.568e-05       
##                                           
##             Sensitivity : 0.95000         
##             Specificity : 0.98000         
##          Pos Pred Value : 0.48718         
##          Neg Pred Value : 0.99898         
##              Prevalence : 0.01961         
##          Detection Rate : 0.01863         
##    Detection Prevalence : 0.03824         
##       Balanced Accuracy : 0.96500         
##                                           
##        'Positive' Class : 0               
##

Now, our accuracy has decreased from 0.9911765 to 0.9794118
recall = 0.95000, among all sick people, we identified 0.95 of them.
precision = 0.48718, among people whom what we think are sick, only half of them are really sick
For healthy people who are identified as sick, we can conduct more medical test. However if we missed some potential sick people, it’s hard to get them back, and their disease will get even worse as days go by.
We sacrificed precion in exchange of recall.

Now, let’s see the effect of sample size

set.seed(2)

num_class1 = 100000
num_class2 = 2000
X = matrix(rep(0, (num_class1 + num_class2)*2), ncol=2)
mu1  = c(-2,-2)
sigma1 = diag(2)
mu2  = c(1,1)
sigma2 = matrix(c(1,0,0,2) ,2,2, byrow = T)

for (i in 1:num_class1) {
    obsev = mvrnorm(1, mu1, sigma1)
    X[i,1] = obsev[1]
    X[i,2] = obsev[2]
}
for (i in (num_class1+1):(num_class1+num_class2)) {
  obsev = mvrnorm(1, mu2, sigma2)
  X[i,1] = obsev[1]
  X[i,2] = obsev[2]
}

label = c(rep(1,num_class1), rep(0,num_class2))

plot(X, col=label+2, main = "Original")

logistic.model = glm(label ~ X , family = "binomial")
summary(logistic.model)

## 
## Call:
## glm(formula = label ~ X, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6695   0.0014   0.0051   0.0180   2.9227  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.83972    0.05698   14.74   <2e-16 ***
## X1          -2.90964    0.07275  -40.00   <2e-16 ***
## X2          -2.24953    0.06125  -36.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 19687.8  on 101999  degrees of freedom
## Residual deviance:  2724.2  on 101997  degrees of freedom
## AIC: 2730.2
## 
## Number of Fisher Scoring iterations: 11

y_hat_prob = predict(logistic.model, data.frame(X), type = "response")
scores.class0 = y_hat_prob[label == 1]
scores.class1 = y_hat_prob[label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)

y_hat = y_hat_prob > 0.5
mean(y_hat == label)

## [1] 0.9956765

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  1688   129
##          1   312 99871
##                                           
##                Accuracy : 0.9957          
##                  95% CI : (0.9953, 0.9961)
##     No Information Rate : 0.9804          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8823          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.84400         
##             Specificity : 0.99871         
##          Pos Pred Value : 0.92900         
##          Neg Pred Value : 0.99689         
##              Prevalence : 0.01961         
##          Detection Rate : 0.01655         
##    Detection Prevalence : 0.01781         
##       Balanced Accuracy : 0.92135         
##                                           
##        'Positive' Class : 0               
##

#strick threshold
y_hat = y_hat_prob > 0.95
mean(y_hat == label)

## [1] 0.9867549

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  1903  1254
##          1    97 98746
##                                          
##                Accuracy : 0.9868         
##                  95% CI : (0.986, 0.9874)
##     No Information Rate : 0.9804         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7316         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.95150        
##             Specificity : 0.98746        
##          Pos Pred Value : 0.60279        
##          Neg Pred Value : 0.99902        
##              Prevalence : 0.01961        
##          Detection Rate : 0.01866        
##    Detection Prevalence : 0.03095        
##       Balanced Accuracy : 0.96948        
##                                          
##        'Positive' Class : 0              
##

now let’s divide data into two parts and see the prediction power

total_size = num_class1 + num_class2
train.index = sample(1:total_size, 0.8*total_size)
train.feature = X[train.index,]
train.label = label[train.index]

test.feature = X[-train.index,]
test.label = label[-train.index]
# check the ratio of two classes in train and test datasets
table(train.label)

## train.label
##     0     1 
##  1616 79984

table(test.label)

## test.label
##     0     1 
##   384 20016

# they are ok. both have around 0.2 positive(sick people) class

logistic.model = glm(train.label ~ X1 + X2,
                     data=data.frame(train.feature),
                     family = "binomial")
summary(logistic.model)

## 
## Call:
## glm(formula = train.label ~ X1 + X2, family = "binomial", data = data.frame(train.feature))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6665   0.0014   0.0050   0.0180   2.9453  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.81650    0.06427   12.70   <2e-16 ***
## X1          -2.93452    0.08168  -35.93   <2e-16 ***
## X2          -2.24043    0.06838  -32.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15875.3  on 81599  degrees of freedom
## Residual deviance:  2179.2  on 81597  degrees of freedom
## AIC: 2185.2
## 
## Number of Fisher Scoring iterations: 11

y_hat_prob = predict(logistic.model, data.frame(train.feature), type = "response")
scores.class0 = y_hat_prob[train.label == 1]
scores.class1 = y_hat_prob[train.label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)

y_hat = y_hat_prob > 0.5
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  1368   103
##          1   248 79881
##                                           
##                Accuracy : 0.9957          
##                  95% CI : (0.9952, 0.9961)
##     No Information Rate : 0.9802          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8841          
##                                           
##  Mcnemar's Test P-Value : 1.516e-14       
##                                           
##             Sensitivity : 0.84653         
##             Specificity : 0.99871         
##          Pos Pred Value : 0.92998         
##          Neg Pred Value : 0.99690         
##              Prevalence : 0.01980         
##          Detection Rate : 0.01676         
##    Detection Prevalence : 0.01803         
##       Balanced Accuracy : 0.92262         
##                                           
##        'Positive' Class : 0               
##

0.5 threshold gives us recall = 0.83871, and precision = 0.92727

y_hat = y_hat_prob > 0.95

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  1537   994
##          1    79 78990
##                                          
##                Accuracy : 0.9869         
##                  95% CI : (0.986, 0.9876)
##     No Information Rate : 0.9802         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7348         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.95111        
##             Specificity : 0.98757        
##          Pos Pred Value : 0.60727        
##          Neg Pred Value : 0.99900        
##              Prevalence : 0.01980        
##          Detection Rate : 0.01884        
##    Detection Prevalence : 0.03102        
##       Balanced Accuracy : 0.96934        
##                                          
##        'Positive' Class : 0              
##

cutoff = 0.95
now recall=0.95066, precision=0.59667
remember our task is to correctly identify sick people more much as possible
so we want high recall.

to test this cutoff

y_hat_prob = predict(logistic.model, newdata=data.frame(test.feature), type = "response")

y_hat = y_hat_prob > 0.95

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(test.label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   368   272
##          1    16 19744
##                                           
##                Accuracy : 0.9859          
##                  95% CI : (0.9842, 0.9875)
##     No Information Rate : 0.9812          
##     P-Value [Acc > NIR] : 1.406e-07       
##                                           
##                   Kappa : 0.712           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.95833         
##             Specificity : 0.98641         
##          Pos Pred Value : 0.57500         
##          Neg Pred Value : 0.99919         
##              Prevalence : 0.01882         
##          Detection Rate : 0.01804         
##    Detection Prevalence : 0.03137         
##       Balanced Accuracy : 0.97237         
##                                           
##        'Positive' Class : 0               
##

So on the test dataset, we still get 0.95704 recall, which is good.

reference: https://statisticalhorizons.com/logistic-regression-for-rare-events Sample size matters.

“you have a sample size of 1000 but only 20 events, you have a problem. If you have a sample size of 10,000 with 200 events, you may be OK. If your sample has 100,000 cases with 2000 events, you’re golden.”

https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression

“The problem is not that the classes are imbalanced per se, it is that there may not be sufficient patterns belonging to the minority class to adequately represent its distribution. This means that the problem can arise for any classifier (even if you have a synthetic problem and you know you have the true model), not just logistic regression. The good thing is that as more data become available, the”class imbalance" problem usually goes away."

let’s try samller sample size

set.seed(2)

num_class1 = 10000
num_class2 = 200
X = matrix(rep(0, (num_class1 + num_class2)*2), ncol=2)
mu1  = c(-2,-2)
sigma1 = diag(2)
mu2  = c(1,1)
sigma2 = matrix(c(1,0,0,2) ,2,2, byrow = T)

for (i in 1:num_class1) {
    obsev = mvrnorm(1, mu1, sigma1)
    X[i,1] = obsev[1]
    X[i,2] = obsev[2]
}
for (i in (num_class1+1):(num_class1+num_class2)) {
  obsev = mvrnorm(1, mu2, sigma2)
  X[i,1] = obsev[1]
  X[i,2] = obsev[2]
}

label = c(rep(1,num_class1), rep(0,num_class2))

plot(X, col=label+2, main = "Original")

now let’s divide data into two parts and see the prediction power

total_size = num_class1 + num_class2
train.index = sample(1:total_size, 0.8*total_size)
train.feature = X[train.index,]
train.label = label[train.index]

test.feature = X[-train.index,]
test.label = label[-train.index]
# check the ratio of two classes in train and test datasets
table(train.label)

## train.label
##    0    1 
##  161 7999

table(test.label)

## test.label
##    0    1 
##   39 2001

# they are ok. both have around 0.2 positive(sick people) class

logistic.model = glm(train.label ~ X1 + X2,
                     data=data.frame(train.feature),
                     family = "binomial")
summary(logistic.model)

## 
## Call:
## glm(formula = train.label ~ X1 + X2, family = "binomial", data = data.frame(train.feature))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6963   0.0016   0.0055   0.0188   2.4608  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.8350     0.1913   4.365 1.27e-05 ***
## X1           -2.6488     0.2391 -11.077  < 2e-16 ***
## X2           -2.4457     0.2219 -11.020  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1582.84  on 8159  degrees of freedom
## Residual deviance:  225.57  on 8157  degrees of freedom
## AIC: 231.57
## 
## Number of Fisher Scoring iterations: 11

y_hat_prob = predict(logistic.model, data.frame(train.feature), type = "response")
scores.class0 = y_hat_prob[train.label == 1]
scores.class1 = y_hat_prob[train.label == 0]
pr1 = pr.curve(scores.class0, scores.class1, curve = T)
plot(pr1)

y_hat = y_hat_prob > 0.5
y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  136   11
##          1   25 7988
##                                           
##                Accuracy : 0.9956          
##                  95% CI : (0.9939, 0.9969)
##     No Information Rate : 0.9803          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.8809          
##                                           
##  Mcnemar's Test P-Value : 0.03026         
##                                           
##             Sensitivity : 0.84472         
##             Specificity : 0.99862         
##          Pos Pred Value : 0.92517         
##          Neg Pred Value : 0.99688         
##              Prevalence : 0.01973         
##          Detection Rate : 0.01667         
##    Detection Prevalence : 0.01801         
##       Balanced Accuracy : 0.92167         
##                                           
##        'Positive' Class : 0               
##

y_hat = y_hat_prob > 0.95

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(train.label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  153  102
##          1    8 7897
##                                           
##                Accuracy : 0.9865          
##                  95% CI : (0.9838, 0.9889)
##     No Information Rate : 0.9803          
##     P-Value [Acc > NIR] : 1.093e-05       
##                                           
##                   Kappa : 0.729           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.95031         
##             Specificity : 0.98725         
##          Pos Pred Value : 0.60000         
##          Neg Pred Value : 0.99899         
##              Prevalence : 0.01973         
##          Detection Rate : 0.01875         
##    Detection Prevalence : 0.03125         
##       Balanced Accuracy : 0.96878         
##                                           
##        'Positive' Class : 0               
##

y_hat_prob = predict(logistic.model, newdata=data.frame(test.feature), type = "response")

y_hat = y_hat_prob > 0.95

y_hat[which(y_hat == FALSE)] <- 0
y_hat[which(y_hat == TRUE)] <- 1
confusionMatrix(as.factor(y_hat), as.factor(test.label))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0   37   26
##          1    2 1975
##                                           
##                Accuracy : 0.9863          
##                  95% CI : (0.9802, 0.9909)
##     No Information Rate : 0.9809          
##     P-Value [Acc > NIR] : 0.03964         
##                                           
##                   Kappa : 0.7189          
##                                           
##  Mcnemar's Test P-Value : 1.383e-05       
##                                           
##             Sensitivity : 0.94872         
##             Specificity : 0.98701         
##          Pos Pred Value : 0.58730         
##          Neg Pred Value : 0.99899         
##              Prevalence : 0.01912         
##          Detection Rate : 0.01814         
##    Detection Prevalence : 0.03088         
##       Balanced Accuracy : 0.96786         
##                                           
##        'Positive' Class : 0               
##

logistic regression imbalanced data analysis

logistic regression

caojilin

5/11/2019