Convert CUDA stuff to kernel utilities abstractions
It shouldn't be too hard to convert the CUDA specific stuff to kutil functors. The code doesn't use any advanced CUDA stuff, its all very simple kernels.
The only exception is I use thrust for some stuff (like sorting) which may be annoying.