Wednesday, February 8, 2012

Credit default probability in R

I was recently trying to solve some problems on Kaggle, the statistical analysis and predictive modeling website. The problem description can be found here:


Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.

So I tried a few different techniques like simple naive bayesian classifier, artificial neural networks, support vector machine but the best result I got was with Random Forests. From the papers I've read, random forests and boosted decision trees are the state of the art in machine learning these days.

Following the the R script I used to generate the solution.


library("randomForest")


training <- read.csv("cs-training.csv")

cols <- c(1,3,4,5,7,8,9)+2
training[,cols] = log(training[,cols]+1)

RF <- randomForest(training[,-c(1,2,7,12)], factor(training$SeriousDlqin2yrs), sampsize=c(20000), do.trace=TRUE, importance=TRUE, ntree=5000, classwt=c(.06,.94), forest=TRUE)

test <- read.csv("cs-test.csv")
test[,cols] = log(test[,cols]+1)

pred <- data.frame(predict(RF,test[,-c(1,2,7,12)]))
names(pred) <- "SeriousDlqin2yrs"

write.csv(pred,file="sampleEntry8.csv")

Although random forests do not need normalized data, I had already done that for Bayes and ANN classifiers before trying random forest but I still left it in there. This gave me a result quite close to the top of the leader board. An AUC difference of only about 0.004 from the top position.

No comments:

Post a Comment