HARAM on RCV1 training, Dense and Sparse implementation comparison in C++

April 12, 2016 § Leave a comment

I was comparing liblinear to the HARAM approach, and the python implementation (numpy for dense and scipy for sparse) was relatively slow. Even in large datasets it was at most comparable.
The algorithm is pretty simple and can be explained with hyperboxes that grow. The growing is done with minimum over the hyperbox lower left corder with the sample and maximum with sample and right upper corner. The latter can be translated to be also a minimum (using 1-x), so it is mainly about taking the minimum. Yet, for sparse implementation there is a better alternative, which was explained in the paper.
The numpy/scipy part was mainly about the calculation of the minimum, which is pretty easy to implement. I did go even further and used cython to implement that part in the sparse mode, since there was a good optimization when transversing over the indexes for minimum and maximum.
I decided to implement the algorithm in c++ and see what could be optimized.
Suffice to say that the dense implementation was on sparse data about 10 times faster than scipy/cython one.
I planned to do a sparse also which would involve then not using simple double(float) arrays but using two vectors one for index and another for values.
Interesting is that the minimum and maximum over the sparse can be optimized if some assumption are taken, the lower bound will be more likely be zero, and only the intersection of nonzero values should be considered, the upper bound of the neuron will be more dense than the comparing sample so saving the value of the sum of the upper-bound will save considerably calculations.

//a is neuron lower bound, b is sample double minimum_AB_sum(vector aData, vector aInd,vector bData,vector bInd){ double sumv=0; int ia = 0; int na = aInd.size(); int ib = 0; int nb = bInd.size(); int cn = min(na,nb);


  while (ia < na & ib  < nb){

    if(aInd.at(ia) > bInd.at(ib)){

      ib ++;

	}

    else if( aInd.at(ia) < bInd.at(ib)){

      ia ++;

	}

    else{

	double aValue=aData.at(ia);

	double bValue=bData.at(ib);

      ib ++;

      ia ++;

	if (aValue<bValue){

	  sumv+=aValue;

	}

	else{

	  sumv+=bValue;

	}

	}

  }

return sumv; }
This simple while loop checks for the intersection between the indices vector aInd and bInd.
double maximum_AB_sum_fast(vector aData, vector aInd,double sumA,vector bData,vector bInd){ //A is neuron with many indicies, B is sample with few indices double sumv; int ia = 0; int na = aInd.size(); int ib = 0; int nb = bInd.size(); //intersection double sumB=0; while (ia < na && ib aValue){ sumA+=bValue; sumA-=aValue; } ib ++; ia ++; } } //values only in B while (ib < nb){ double bValue=bData.at(ib); sumB+=bValue; ib ++; }


  return sumA+ sumB;

}

This is a pretty fast solution, still there is a big mistake which comes from the fact, that I am more comfortable with Java.
The vectors need to be passed by reference, i.e. double maximum_AB_sum_fast(vector &aData, vector &aInd,… because else c++ will copy the whole vector, instead of just passing the reference.
This means for RCV1 training (23149 samples) with 20835 and 103 labels it took 1.2 seconds to train and 6.8 s (2314 samples) to test.
Using the reference it took 0.8s to train and 4.1 s to test.
This problem was massive when I tried to use instead of two vectors (one index other data) to use hashmappings.
They were massively slow, but with the reference it was much faster, still not as fast as the routine above.
Unordered was faster with training time of 1.2s and test 9.3s. Orderered took 2.6s to train and 21.4s to test.
This also supports the findings of this comparison.
In the whole data with 780k samples training and 23149 test samples, it took about x seconds to train and 217s to test,
Liblinear only 6.7s. Still, comparing the number of computations the code is close to the efficiency of liblinear:
ML-HARAM does need the minimum operation for lower and upper bound requiring in mean
about 2884 (=2*(354(clusters)+17*$\frac{22684 neurons}{354 cluster}$)) operations per sample and
non-zero feature. Since SVM is BR it would
do 103 times so the difference would be about 28 (2884/103) times. 6.3s*28 is
176.4s which is close to the 217.8s which ML-HARAM needed.

Fbenites writing exercises