Tuesday, June 10, 2014

Comparing k-NN in Factor

Recently a pair of blog posts compared implementations of a k-nearest neighbour (k-NN) classifier in F# and OCaml. Subsequently an implementation showing performance in Rust got my attention and I thought it might be nice to demonstrate a version in Factor.

The first OCaml version is 30 lines of code and takes 21 seconds on my laptop:

$ sloccount classifyDigits.ml
ml:              30 (100.00%)

$ time ./classifyDigits
Percentage correct:94.400000

real 0m21.292s
user 0m21.152s
sys 0m0.120s

The second OCaml version is 47 lines of code and takes 12 seconds:

$ sloccount classifyDigitsArray.ml
ml:              47 (100.00%)

$ time ./classifyDigitsArray 
Percentage correct:94.400000

real 0m12.563s
user 0m12.434s
sys 0m0.120s

Note: I couldn't get the parallel version to run, but would assume it to have the same 2x speedup that the author saw.

Simple

It is often useful to start with the simplest possible code before trying to optimize for performance. I decided to parse the training and validation files (containing comma-separated values, the first of which is the label and the subsequent values are observations) into an array of arrays.

: slurp-file ( path -- {pixels,label} )
    ascii file-lines rest [
        "," split [ string>number ] map unclip 2array
    ] map ;

: classify ( training pixels -- label )
    '[ first _ distance ] infimum-by second ;

: k-nn ( -- )
    "~/trainingsample.csv" slurp-file
    "~/validationsample.csv" slurp-file
    [ [ first2 [ classify ] [ = ] bi* ] with count ]
    [ length ] bi / 100.0 * "Percentage correct: %.1f\n" printf ;

You can see that it produces the desired output of 94.4% correct, and takes about 40 seconds on my laptop.

IN: scratchpad gc [ k-nn ] time
Percentage correct: 94.4
Running time: 40.283777984 seconds

Not too bad for 11 lines of simple code, but slower than it could be. Much of the performance penalty in this version is due to the large amount of generic dispatch, which is something we hope to reduce in future versions of Factor.

Faster

I noticed all the observed values were in the range [0-255], so thought a simple speedup might be to store them in a byte-array, and instead of using the builtin distance word, make my own that specializes on byte-arrays.

: slurp-file ( path -- {pixels,label} )
    ascii file-lines rest [
        "," split [ string>number ] B{ } map-as unclip 2array
    ] map ;

: distance ( x y -- z )
    { byte-array byte-array } declare 0 [ - sq + ] 2reduce ;

: classify ( training pixels -- label )
    '[ first _ distance ] infimum-by second ;

: k-nn ( -- )
    "~/trainingsample.csv" slurp-file
    "~/validationsample.csv" slurp-file
    [ [ first2 [ classify ] [ = ] bi* ] with count ]
    [ length ] bi / 100.0 * "Percentage correct: %f\n" printf ;

With that simple change, we get 7x faster than our previous version and roughly as fast as the fastest parallel OCaml version!

IN: scratchpad gc [ k-nn ] time
Percentage correct: 94.400000
Running time: 5.708627884 seconds

The Rust version requires a nightly build and I haven't had a chance to test it, but I assume it is a bit faster, and discussions on r/rust, r/programming, and Hacker News show some fast versions in C++ and D as well.

The code for this is in my GitHub.

No comments: