EDIT: There was a bug in the final version of the code presented here. It is fixed now, for its backstory, check out my blog post on it.

When we last saw our hero, he was fighting with the dreaded implementation of least-angle regression, knowing full well that it was his destiny to be faster.

We had come up with a more robust implementation, catching malformed cases that would have broken the naive implementation, and also it was orders of magnitude faster than said implementation. However, our benchmark [1] showed that it was a couple of times slower than least-angle regression.

By poking around the `scikits.learn`

codebase, I noticed that there is a triangular system solver in `scikits.learn.utils.arrayfuncs`

. Unlike the `scipy.linalg`

one, this one only works with lower triangular arrays, and it forcefully overwrites `b`

. Even though if weren’t faster, it should still be used: `scikits.learn`

aims to be as backwards-compatible with SciPy as possible, and `linalg.solve_triangular`

was added in 0.9.0. Anyway, let’s just see whether it’s faster:

In [1]: import numpy as np In [2]: from scipy import linalg In [3]: from scikits.learn.datasets import make_spd_matrix In [4]: from scikits.learn.utils.arrayfuncs import solve_triangular In [5]: G = make_spd_matrix(1000) In [6]: L = linalg.cholesky(G, lower=True) In [7]: x = np.random.randn(1000) In [8]: y = x.copy() In [9]: timeit solve_triangular(L, x) 100 loops, best of 3: 3.45 ms per loop In [10]: timeit linalg.solve_triangular(L, y, lower=True, overwrite_b=True) 10 loops, best of 3: 134 ms per loop

Wow! That’s 40x faster. We’re catching two rabbits with one stone here, let’s do the change! Notice that we can just copy into the appropriate place in and then solve in place.

But whoops! When solving the system, we take advantage of the `transpose`

attribute in `linalg.solve_triangular`

, which the `scikits.learn`

version does not expose. We could think of a solution, but here’s a better idea: Shouldn’t there be some way to directly solve the entire system in one go?

Well, there is. It is an LAPACK function by the name of `potrs`

. If you are not aware, LAPACK is a Fortran library with solvers for various types of linear systems and eigenproblems. LAPACK along with BLAS (on which it is based) pretty much powers all the scientific computation that happens. BLAS is an API with multiple implementations dating from 1979, while LAPACK dates from 1992. If you ever used Matlab, this is what was called behind the scenes. SciPy, again, provides a high-level wrapper around this, the `linalg.cho_solve`

function.

But SciPy also gives us the possibility to import functions directly from LAPACK, through the use of `linalg.lapack.get_lapack_funcs`

. Let’s see how the low-level LAPACK function compares to the SciPy wrapper, for our use case:

In [11]: x = np.random.randn(1000) In [12]: y = x.copy() In [13]: timeit linalg.cho_solve((L, True), x) 1 loops, best of 3: 95.4 ms per loop In [14]: potrs, = linalg.lapack.get_lapack_funcs(('potrs',), (G,)) In [15]: potrs Out[15]: <fortran object> In [16]: timeit potrs(L, y) 100 loops, best of 3: 9.49 ms per loop

That’s 10 times faster! So now we found an obvious way to optimize the code:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None): min_float = np.finfo(X.dtype).eps potrs, = get_lapack_funcs(('potrs',), (X,)) alpha = np.dot(X.T, y) residual = y n_active = 0 idx = [] max_features = X.shape[1] if eps is not None else n_nonzero_coefs L = np.empty((max_features, max_features), dtype=X.dtype) L[0, 0] = 1. while 1: lam = np.abs(np.dot(X.T, residual)).argmax() if lam < n_active or alpha[lam] ** 2 < min_float: # atom already selected or inner product too small warn("Stopping early") break if n_active > 0: # Updates the Cholesky decomposition of X' X L[n_active, :n_active] = np.dot(X[:, idx].T, X[:, lam] solve_triangular(L[:n_active, :n_active], L[n_active, :n_active]) d = np.dot(L[n_active, :n_active].T, L[n_active, :n_active]) if 1 - d <= min_float: # selected atoms are dependent warn("Stopping early") break L[n_active, n_active] = np.sqrt(1 - d) idx.append(lam) # solve LL'x = y in two steps: gamma, _ = potrs(L[:n_active, :n_active], alpha[idx], lower=True, overwrite_b=False) residual = y - np.dot(X[:, idx], gamma) if eps is not None and np.dot(residual.T, residual) <= eps: break elif n_active == max_features: break return gamma, idx

Woohoo! But we still lag behind. Now that we delegated the trickiest parts of the code to fast and reliable solvers, it’s time to use a profiler and see what the bottleneck is now. Python has excellent tools for this purpose. What solved the problem in this case was `line_profiler`

[2]. There is a great article by Olivier Grisel here [2] regarding how to use these profilers. I’m just going to say that `line_profiler`

‘s output is very helpful, basically printing the time taken by each line of code next to that line.

Running the profiler on this code, we found that 58% of the time is spent on line 14, 20.5% on line 21, and 20.5% on line 32, with the rest being insignificant (`potrs`

takes 0.1%!). The code is clearly dominated by the matrix multiplications. By running some more timings with IPython, I found that multiplying such column-wise views of the data as `X[:, idx]`

is considerably slower then multiplying a contiguous array. The least-angle regression code in `scikits.learn`

avoids this by swapping columns towards the front of the array as they are chosen, so we can replace `X[:, idx]`

with `X[:, :n_active]`

. The nice part is that if the array is stored in Fortran-contiguous order (ie. column contiguous order, as opposed to row contiguous order, as in C), swapping two columns is a very fast operation!. Let’s see some more benchmarks!

In [17]: X = np.random.randn(5000, 5000) In [18]: Y = X.copy('F') # fortran-ordered In [19]: a, b = 1000, 2500 In [20]: swap, = linalg.get_blas_funcs(('swap',), (X,)) In [21]: timeit X[:, a], X[:, b] = swap(X[:, a], X[:, b]) 100 loops, best of 3: 6.29 ms per loop In [22]: timeit Y[:, a], Y[:, b] = swap(Y[:, a], Y[:, b]) 10000 loops, best of 3: 111 us per loop

We can see that using Fortran-order takes us from the order of miliseconds to the order of microseconds!

Side note: I almost fell into the trap of swapping columns the pythonic way. That doesn’t work:

In [23]: X[:, a], X[:, b] = X[:, b], X[:, a] In [24]: np.testing.assert_array_equal(X[:, a], X[:, b]) In [25]:

However this trick works great for swapping elements of one-dimensional arrays.

Another small optimization that we can do: I found that on my system, it’s slightly faster to compute the norm using the BLAS function `nrm2`

. So by putting all of these together, we end up with the final version of our code:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None, overwrite_X=False): if not overwrite_X: X = X.copy('F') else: # even if we are allowed to overwrite, still copy it if bad order X = np.asfortranarray(X) min_float = np.finfo(X.dtype).eps nrm2, swap = linalg.get_blas_funcs(('nrm2', 'swap'), (X,)) potrs, = get_lapack_funcs(('potrs',), (X,)) indices = range(len(Gram)) # keeping track of swapping alpha = np.dot(X.T, y) residual = y n_active = 0 max_features = X.shape[1] if eps is not None else n_nonzero_coefs L = np.empty((max_features, max_features), dtype=X.dtype) L[0, 0] = 1. while True: lam = np.abs(np.dot(X.T, residual)).argmax() if lam < n_active or alpha[lam] ** 2 < min_float: # atom already selected or inner product too small warn("Stopping early") break if n_active > 0: # Updates the Cholesky decomposition of X' X L[n_active, :n_active] = np.dot(X[:, :n_active].T, X[:, lam]) solve_triangular(L[:n_active, :n_active], L[n_active, :n_active]) v = nrm2(L[n_active, :n_active]) ** 2 if 1 - v <= min_float: # selected atoms are dependent warn("Stopping early") break L[n_active, n_active] = np.sqrt(1 - v) X.T[n_active], X.T[lam] = swap(X.T[n_active], X.T[lam]) alpha[n_active], alpha[lam] = alpha[lam], alpha[n_active] indices[n_active], indices[lam] = indices[lam], indices[n_active] n_active += 1 # solves LL'x = y as a composition of two triangular systems gamma, _ = potrs(L[:n_active, :n_active], alpha[:n_active], lower=True, overwrite_b=False) residual = y - np.dot(X[:, :n_active], gamma) if eps is not None and nrm2(residual) ** 2 <= eps: break elif n_active == max_features: break return gamma, indices[:n_active]

Now, the benchmark at [1] indicates victory over least-angle regression! I hope you have enjoyed this short tour. See you next time!

[1] OMP vs. Lars benchmark

[2] Profiling Python code

*dictionary learning, python, scikits.learn*

Mathieu

August 11, 2011

It should be possible to use potrs for the Ridge regression module as well.

Gael Varoquaux

August 12, 2011

You are getting pretty at writing blog posts :).

But you are leaving all the cheesy part out: how much faster then Lars is OMP, at the end? I cannot run the benchmarks on my mobile phone 🙂