I was recently trying to solve some problems on Kaggle, the statistical analysis and predictive modeling website. The problem description can be found here:
Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.
So I tried a few different techniques like simple naive bayesian classifier, artificial neural networks, support vector machine but the best result I got was with Random Forests. From the papers I've read, random forests and boosted decision trees are the state of the art in machine learning these days.
Following the the R script I used to generate the solution.
library("randomForest")
training <- read.csv("cs-training.csv")
cols <- c(1,3,4,5,7,8,9)+2
training[,cols] = log(training[,cols]+1)
RF <- randomForest(training[,-c(1,2,7,12)], factor(training$SeriousDlqin2yrs), sampsize=c(20000), do.trace=TRUE, importance=TRUE, ntree=5000, classwt=c(.06,.94), forest=TRUE)
test <- read.csv("cs-test.csv")
test[,cols] = log(test[,cols]+1)
pred <- data.frame(predict(RF,test[,-c(1,2,7,12)]))
names(pred) <- "SeriousDlqin2yrs"
write.csv(pred,file="sampleEntry8.csv")
Although random forests do not need normalized data, I had already done that for Bayes and ANN classifiers before trying random forest but I still left it in there. This gave me a result quite close to the top of the leader board. An AUC difference of only about 0.004 from the top position.
No comments:
Post a Comment