Friday, February 24, 2012

Prime numbers and Prime factorization in C++

I was working on one of the math problems from Project Euler and wrote some useful little snippets of code. Perhaps could be useful to someone else as well. One of them was a simple class to generate prime numbers in a sequence (you often need them in a sequence when trying to find prime factors). Following piece of C++ code generates prime numbers every time the operator() is called.


class PrimeGenerator {
private:
int current;
bool prime(int n) {
for (int i=2; i<=n/2; i++) {
if (!(n%i)) return false;
}
return true;
}
public:
PrimeGenerator() : current(1) {}
int operator()() {
while (!prime(++current));
return current;
}
};

The way to use it in code is given below.



for (int i=0; i<20; i++) {

cout<<generator()<<", ";
}


I used the class in a function to find prime factors as given below:


vector<int> primeFactors(int n) {

vector<int> result;

PrimeGenerator generator;
int primeFactor;
int remainder = n;
do {
primeFactor = generator();
while (remainder%primeFactor == 0) {
result.push_back(primeFactor);
remainder /= primeFactor;
}
} while (remainder != 1);
return result;
}


Prime factorization is a common task in many number theory problems. There is no water tight guarantees for this code but feel free to use modifications of it as you like. I have not checked for the bounds and other cases. Just a naive implementation. Do post a comment if you have suggestions to improve it though :)

Wednesday, February 8, 2012

Credit default probability in R

I was recently trying to solve some problems on Kaggle, the statistical analysis and predictive modeling website. The problem description can be found here:


Improve on the state of the art in credit scoring by predicting the probability that somebody will experience financial distress in the next two years.

So I tried a few different techniques like simple naive bayesian classifier, artificial neural networks, support vector machine but the best result I got was with Random Forests. From the papers I've read, random forests and boosted decision trees are the state of the art in machine learning these days.

Following the the R script I used to generate the solution.


library("randomForest")


training <- read.csv("cs-training.csv")

cols <- c(1,3,4,5,7,8,9)+2
training[,cols] = log(training[,cols]+1)

RF <- randomForest(training[,-c(1,2,7,12)], factor(training$SeriousDlqin2yrs), sampsize=c(20000), do.trace=TRUE, importance=TRUE, ntree=5000, classwt=c(.06,.94), forest=TRUE)

test <- read.csv("cs-test.csv")
test[,cols] = log(test[,cols]+1)

pred <- data.frame(predict(RF,test[,-c(1,2,7,12)]))
names(pred) <- "SeriousDlqin2yrs"

write.csv(pred,file="sampleEntry8.csv")

Although random forests do not need normalized data, I had already done that for Bayes and ANN classifiers before trying random forest but I still left it in there. This gave me a result quite close to the top of the leader board. An AUC difference of only about 0.004 from the top position.