Problems are Good: Credit default probability in R

I was recently trying to solve some problems on Kaggle, the statistical analysis and predictive modeling website. The problem description can be found here:

Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.

So I tried a few different techniques like simple naive bayesian classifier, artificial neural networks, support vector machine but the best result I got was with Random Forests. From the papers I've read, random forests and boosted decision trees are the state of the art in machine learning these days.

Following the the R script I used to generate the solution.

library("randomForest")

training <- read.csv("cs-training.csv")

cols <- c(1,3,4,5,7,8,9)+2

training[,cols] = log(training[,cols]+1)

RF <- randomForest(training[,-c(1,2,7,12)], factor(training$SeriousDlqin2yrs), sampsize=c(20000), do.trace=TRUE, importance=TRUE, ntree=5000, classwt=c(.06,.94), forest=TRUE)

test <- read.csv("cs-test.csv")

test[,cols] = log(test[,cols]+1)

pred <- data.frame(predict(RF,test[,-c(1,2,7,12)]))

names(pred) <- "SeriousDlqin2yrs"

write.csv(pred,file="sampleEntry8.csv")

Although random forests do not need normalized data, I had already done that for Bayes and ANN classifiers before trying random forest but I still left it in there. This gave me a result quite close to the top of the leader board. An AUC difference of only about 0.004 from the top position.

Problems are Good

Wednesday, February 8, 2012

Credit default probability in R

Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.

No comments:

Post a Comment