Luckily we were able to find it and fix it very quickly. I have updated the old entries I wrote on the OMP optimizations, so they no longer include the bug. But I take this opportunity to explain what exactly went wrong.

A key part of the optimization was that slicing out arbitrary columns out of an array is slow when they are passed to BLAS functions like matrix multiplication. In order to make the most out of your code, the data should have a contiguous layout. We achieved this by swapping active dictionary atoms (columns) to the beginning of the array.

Something that can happen, but won’t happen very often, is that after an atom is selected as active, the atom that takes its place after swapping needs to be selected. This is rare because dictionaries have many columns, out of which only very very few will be active. But when it happens, because the code didn’t keep track of swapped indices, the corresponding coefficient of the solution would get updated twice, leading to more zero entries than we should have. A keen eye could have noticed that the first `n_nonzero_coefs` entries in OMP solution vectors were never non-zero. But alas, my eye was not a keen one at all.

In other words, the following test (that was written after the bug was found, unfortunately) was failing:

def test_swapped_regressors(): gamma = np.zeros(n_features) # X[:, 21] should be selected first, then X[:, 0] selected second, # which will take X[:, 21]'s place in case the algorithm does # column swapping for optimization (which is the case at the moment) gamma[21] = 1.0 gamma[0] = 0.5 new_y = np.dot(X, gamma) new_Xy = np.dot(X.T, new_y) gamma_hat = orthogonal_mp(X, new_y, 2) gamma_hat_gram = orthogonal_mp_gram(G, new_Xy, 2) # active indices should be [0, 21], but prior to the bugfix # the algorithm would update only [21] but twice assert_equal(np.flatnonzero(gamma_hat), [0, 21]) assert_equal(np.flatnonzero(gamma_hat_gram), [0, 21])

Note that this bug has been fixed for a while, but I didn’t get the free time to write this post until now. Good news is: we fixed it, and did so very quickly after the report. So you can still trust me, I guess!

]]>The ratio-of-uniforms is a method that can be applied to many density functions. Essentially, given a density function over , where is a normalization constant (ie. , ). Given a parameter and a parametrization from to expressed as .

Define the set where

. If is bounded and we sample a uniform vector then . Also note that the measure (volume) of the set is . I do not have any references for the proof, except for a book in Romanian, but if you are interested, just leave me a comment and I’ll do a follow-up post with the proofs.

For the univariate case, all the above simplifies to . We generate and take .

Since we are looking at the (univariate) Gamma distribution, described by . is the shape parameter and is the scale parameter.

But because of the property that if , then for any , , we conclude that we can fix to 1 without loss of generality. Replacing in the style of the definition in the previous section, we have and .

This allows us to compute the equation of the boundary of the set which ends up being described by . For visualisation purposes, here is how it would look like for (plotted using Wolfram Alpha):

In order to uniformly sample from this set, we can apply basic rejection sampling: just uniformly sample from a rectangular region surrounding the set, and reject the points that do not satisfy the condition. In order to do this as efficiently as possible, we need to compute the minimal bounding box, which can be done by solving a couple of optimization problems using Lagrange multipliers and the KKT conditions. Also by looking closely at the image, you can see that the lower left corner is exactly the origin: this turns out not to be a coincidence. I won’t go into detail here, but here are the bounds I derived:

and

The probability of acceptance (which can be seen as the efficiency) of the rejection sampling method is given by the ratio of the areas of the set and the bounding box. The larger this probability, the less points we throw away and the more efficient the algorithm is. Using the values derived above, this probability is

Personally I got stumped here. The idea would be to determine the ideal for a given in order to maximize the probability, but I didn’t manage to do it (I leave it as an exercise for the reader ;)). Anyway, this is enough to proceed with an implementation, so I’m gonna give the Python code for it. Note that I used the name k for the shape parameter instead of . Also note that the case when needed to be treated separately, which I did using the following property: Let . If then . For a proof of this fact, see [1], which is a great article on generating Gamma variates.

from import numpy as np def _cond(u, v, k, c): """Identity function describing the acceptance region""" x = v / u ** c return (c + 1) * np.log(u) <= (k - 1) * np.log(x) - x def vn_standard_gamma(k, c=1.0, rng=np.random): """Generates a single standard gamma random variate""" if k <= 0: raise ValueError("Gamma shape should be positive") elif k < 1: return vn_standard_gamma(1 + k, c, rng) * rng.uniform() ** (1 / k) elif k == 1: return rng.standard_exponential() else: a, b = get_bounds(k, c) while True: u, v = rng.uniform(0, a), rng.uniform(0, b) if _cond(u, v, k, c): break; return v / u ** c def vn_gamma(k, t, shape=1, c=1.0, rng=np.random): """Vectorized function to generate multiple gamma variates""" generator = lambda x: t * vn_standard_gamma(k, c, rng) generator = np.vectorize(generator) return generator(np.empty(shape)) def get_bounds(k, c=1.0): """Computes the minimal upper bounds surrounding the acceptance region""" a = ((k - 1) / np.e) ** ((k - 1) / (c + 1)) b = ((c * k + 1) / (c * np.e)) ** ((c * k + 1) / (c + 1)) return a, b def prob_acc(k, c=1.0): """Calculates the probability of acceptance for the given parameters""" from scipy.special import gamma a, b = get_bounds(k, c) return gamma(k) / ((c + 1) * a * b)

And of course I should show you that it works. Here are some histograms for various values of , with the theoretical density plotted in dotted red, after sampling values. The y-axis is the frequency (sorry for labeling in Romanian), and for the red dotted line it can be interpreted as the theoretical probability. You can clearly see the goodness of fit is excellent.

[1]: George Marsaglia and Wai Wan Tsang. 1998. The Monty Python method for generating random variables. ACM Trans. Math. Softw. 24, 3 (September 1998), 341-350. DOI=10.1145/292395.292453

]]>I must begin by thanking them: the organization was impeccable! I’m not sure, but I think that at one point Ivelina was even running around buying routers to improve wifi coverage (which is already spectacular in Bulgaria — I’ve received reports from Miranda that you can get wifi in the mountains!)

The schedule was busy, with three tracks going in parallel, in order to cover a wide range of topics in computational linguistics. The student workshop should also be noted for the excellent quality of the works there.

Of course it would be infeasible to write about all the great people I met and their high quality work. And if I were to write about all the fun we had, it would probably make this post look unprofessional :). This doesn’t mean I forgot about any of you, and as soon as I get the chance to work on something related, I will most certainly write about it, and you.

So, if I would have to summarize the trends and the ideas stated during the conference and especially during the keynotes, I would say:

- When talking about word sense disambiguation, it’s wrong to speak about the different meanings of a word, but rather about the potential a word has for bringing a certain meaning in a certain context. See Patrick Hanks‘ Corpus Pattern Analysis. Without something like this, to have good WSD you need to heavily adjust the overlapping meanings from a Wordnet-style ontology.
- Certain relations, such as temporal and spacial ones, can naturally be modeled by complex domain-specific logics (see Inderjeet Mani‘s new book, Interpreting Motion: Grounded Representations for Spatial Language, that is due for publishing). But these only appear in a small subset of human communication. The attempt to map human language to a complete logic in which to do general-purpose inference seems futile: Ido Dagan suggests textual entailment: reasoning directly in natural language, and only abstracting away to a formal logic system when need arises.
- If you have a large enough sample of n-gram frequency data, you can eventually beat the performance you can get with a limited amount of labeled data, and most importantly: it generalizes much better when going out of the domain you trained on. Apparently the best tool for this at the moment is the Google n-gram data, which has some limitations. In time, we can easily extend this data by huge amounts by mining n-grams from Wikipedia (which allegedly has a higher count of distinct n-grams than the Google dataset), and more importantly, by aligning multi-language data, making use of transliterations and cognate identification.

Please note that I may be (or most probably, am) ignorant of older instances of similar ideas, and I may have misunderstood certain claims. Please feel free to discuss in the comments whether you think I forgot about something important, or whether I am plain wrong about something. In particular, I seem to have been completely ignorant of the existence of the Google n-gram data, which has been around for quite a while, so I must have missed other important things as well.

Take care, kind readers, and express your opinion!

V

The stars of this new release are: the manifold learning module by Jake Vanderplas and Fabian Pedregosa, the Dirichlet process gaussian mixture model by Alexandre Passos, and many others, as you can see from the development changelog (as soon as the release is made, I will update this post with permanent links).

The release is due tomorrow. I will also be in charge with building the Windows installers for this release, let’s hope I do a good job and you can think of me and smile when installing!

]]>For the last two weeks, I’ve been traveling. I attended the EuroScipy conference thanks to Fabian, who offered me a place to sleep during the week. We sprinted hard, we discussed tricky APIs, we drank a lot of coffee, beer, and ate well in lovely Paris. It was great to meet all of the celebrities, the people who keep the scientific Python globe turning.

Many thanks to Gael and Emmanuelle, who worked very, very hard on organizing everything, so they weren’t around and I didn’t get to say goodbye when I ran to catch my plane last Sunday.

I was in a hurry, heading to Tarragona, a beautiful city on the Catalan coast, where the public university organized the 2011 summer school in linguistics and speech technologies (SSLST). This was a great opportunity to meet many fellow young researchers working in computational linguistics. I will not go into details now, because I plan expand on this, but I would like to state a couple of things. Firsty, even though NLP seems to be mostly a Java-dominated affair (note for example Stanford’s NLP toolkit and Sheffield’s GATE), the computational linguistics and psycholinguistics research center (CLiPS) at the University of Antwerp actually briefly manifested its devotion to Python and NLTK via its research director, Walter Daelemans.

It was good to see a little love for Python in this field. NLTK is very underrepresented in the SciPy community, I couldn’t find anybody at the EuroScipy conference knowing too much about it or about the people behind it.

Another lab that has done a lot of cool work is Cental at the Catholic University of Louvain, and they also use Python for natural language processing. Maybe in the coming years, we will see a Python for Computational Linguistics sattelite, along with Physics and Neuroscience. Doesn’t it sound more fun?

Secondly, I wish SSLST were organized by someone like Gael! As the discussion at dinner regarding who will organize next year’s EuroScipy went, it is imperative that the organizers be actively involved in the community, and generally passionate about it. Even though I’m comparing apples and oranges, Carlos Martin-Vide behaved in this context like a old, tired, emotionless academic, not taking into account even lunch breaks for the whole group, not to mention any sort of getting together or even a group photo (which, alas, we were not able to take, apart from small groups.) They said it couldn’t be done. Of course it could, they just didn’t want it hard enough.

Finally, before signing off, I would like to announce that because the Romanian Ministry of Education failed to specify the allocated number of public positions for masters’ programmes, the admission exam at the University of Bucharest will be delayed by a couple of weeks. Luckily, this will allow me to attend RANLP 2011 in Hissar, Bulgaria a week from now, where I will present my poster entitled:

“Can alternations be learned? A machine learning approach to Romanian verb conjugation” by Liviu P. Dinu, Emil Ionescu, Vlad Niculae and Octavia-Maria Sulea. See you in Hissar!

When we last saw our hero, he was fighting with the dreaded implementation of least-angle regression, knowing full well that it was his destiny to be faster.

We had come up with a more robust implementation, catching malformed cases that would have broken the naive implementation, and also it was orders of magnitude faster than said implementation. However, our benchmark [1] showed that it was a couple of times slower than least-angle regression.

By poking around the `scikits.learn`

codebase, I noticed that there is a triangular system solver in `scikits.learn.utils.arrayfuncs`

. Unlike the `scipy.linalg`

one, this one only works with lower triangular arrays, and it forcefully overwrites `b`

. Even though if weren’t faster, it should still be used: `scikits.learn`

aims to be as backwards-compatible with SciPy as possible, and `linalg.solve_triangular`

was added in 0.9.0. Anyway, let’s just see whether it’s faster:

In [1]: import numpy as np In [2]: from scipy import linalg In [3]: from scikits.learn.datasets import make_spd_matrix In [4]: from scikits.learn.utils.arrayfuncs import solve_triangular In [5]: G = make_spd_matrix(1000) In [6]: L = linalg.cholesky(G, lower=True) In [7]: x = np.random.randn(1000) In [8]: y = x.copy() In [9]: timeit solve_triangular(L, x) 100 loops, best of 3: 3.45 ms per loop In [10]: timeit linalg.solve_triangular(L, y, lower=True, overwrite_b=True) 10 loops, best of 3: 134 ms per loop

Wow! That’s 40x faster. We’re catching two rabbits with one stone here, let’s do the change! Notice that we can just copy into the appropriate place in and then solve in place.

But whoops! When solving the system, we take advantage of the `transpose`

attribute in `linalg.solve_triangular`

, which the `scikits.learn`

version does not expose. We could think of a solution, but here’s a better idea: Shouldn’t there be some way to directly solve the entire system in one go?

Well, there is. It is an LAPACK function by the name of `potrs`

. If you are not aware, LAPACK is a Fortran library with solvers for various types of linear systems and eigenproblems. LAPACK along with BLAS (on which it is based) pretty much powers all the scientific computation that happens. BLAS is an API with multiple implementations dating from 1979, while LAPACK dates from 1992. If you ever used Matlab, this is what was called behind the scenes. SciPy, again, provides a high-level wrapper around this, the `linalg.cho_solve`

function.

But SciPy also gives us the possibility to import functions directly from LAPACK, through the use of `linalg.lapack.get_lapack_funcs`

. Let’s see how the low-level LAPACK function compares to the SciPy wrapper, for our use case:

In [11]: x = np.random.randn(1000) In [12]: y = x.copy() In [13]: timeit linalg.cho_solve((L, True), x) 1 loops, best of 3: 95.4 ms per loop In [14]: potrs, = linalg.lapack.get_lapack_funcs(('potrs',), (G,)) In [15]: potrs Out[15]: <fortran object> In [16]: timeit potrs(L, y) 100 loops, best of 3: 9.49 ms per loop

That’s 10 times faster! So now we found an obvious way to optimize the code:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None): min_float = np.finfo(X.dtype).eps potrs, = get_lapack_funcs(('potrs',), (X,)) alpha = np.dot(X.T, y) residual = y n_active = 0 idx = [] max_features = X.shape[1] if eps is not None else n_nonzero_coefs L = np.empty((max_features, max_features), dtype=X.dtype) L[0, 0] = 1. while 1: lam = np.abs(np.dot(X.T, residual)).argmax() if lam < n_active or alpha[lam] ** 2 < min_float: # atom already selected or inner product too small warn("Stopping early") break if n_active > 0: # Updates the Cholesky decomposition of X' X L[n_active, :n_active] = np.dot(X[:, idx].T, X[:, lam] solve_triangular(L[:n_active, :n_active], L[n_active, :n_active]) d = np.dot(L[n_active, :n_active].T, L[n_active, :n_active]) if 1 - d <= min_float: # selected atoms are dependent warn("Stopping early") break L[n_active, n_active] = np.sqrt(1 - d) idx.append(lam) # solve LL'x = y in two steps: gamma, _ = potrs(L[:n_active, :n_active], alpha[idx], lower=True, overwrite_b=False) residual = y - np.dot(X[:, idx], gamma) if eps is not None and np.dot(residual.T, residual) <= eps: break elif n_active == max_features: break return gamma, idx

Woohoo! But we still lag behind. Now that we delegated the trickiest parts of the code to fast and reliable solvers, it’s time to use a profiler and see what the bottleneck is now. Python has excellent tools for this purpose. What solved the problem in this case was `line_profiler`

[2]. There is a great article by Olivier Grisel here [2] regarding how to use these profilers. I’m just going to say that `line_profiler`

‘s output is very helpful, basically printing the time taken by each line of code next to that line.

Running the profiler on this code, we found that 58% of the time is spent on line 14, 20.5% on line 21, and 20.5% on line 32, with the rest being insignificant (`potrs`

takes 0.1%!). The code is clearly dominated by the matrix multiplications. By running some more timings with IPython, I found that multiplying such column-wise views of the data as `X[:, idx]`

is considerably slower then multiplying a contiguous array. The least-angle regression code in `scikits.learn`

avoids this by swapping columns towards the front of the array as they are chosen, so we can replace `X[:, idx]`

with `X[:, :n_active]`

. The nice part is that if the array is stored in Fortran-contiguous order (ie. column contiguous order, as opposed to row contiguous order, as in C), swapping two columns is a very fast operation!. Let’s see some more benchmarks!

In [17]: X = np.random.randn(5000, 5000) In [18]: Y = X.copy('F') # fortran-ordered In [19]: a, b = 1000, 2500 In [20]: swap, = linalg.get_blas_funcs(('swap',), (X,)) In [21]: timeit X[:, a], X[:, b] = swap(X[:, a], X[:, b]) 100 loops, best of 3: 6.29 ms per loop In [22]: timeit Y[:, a], Y[:, b] = swap(Y[:, a], Y[:, b]) 10000 loops, best of 3: 111 us per loop

We can see that using Fortran-order takes us from the order of miliseconds to the order of microseconds!

Side note: I almost fell into the trap of swapping columns the pythonic way. That doesn’t work:

In [23]: X[:, a], X[:, b] = X[:, b], X[:, a] In [24]: np.testing.assert_array_equal(X[:, a], X[:, b]) In [25]:

However this trick works great for swapping elements of one-dimensional arrays.

Another small optimization that we can do: I found that on my system, it’s slightly faster to compute the norm using the BLAS function `nrm2`

. So by putting all of these together, we end up with the final version of our code:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None, overwrite_X=False): if not overwrite_X: X = X.copy('F') else: # even if we are allowed to overwrite, still copy it if bad order X = np.asfortranarray(X) min_float = np.finfo(X.dtype).eps nrm2, swap = linalg.get_blas_funcs(('nrm2', 'swap'), (X,)) potrs, = get_lapack_funcs(('potrs',), (X,)) indices = range(len(Gram)) # keeping track of swapping alpha = np.dot(X.T, y) residual = y n_active = 0 max_features = X.shape[1] if eps is not None else n_nonzero_coefs L = np.empty((max_features, max_features), dtype=X.dtype) L[0, 0] = 1. while True: lam = np.abs(np.dot(X.T, residual)).argmax() if lam < n_active or alpha[lam] ** 2 < min_float: # atom already selected or inner product too small warn("Stopping early") break if n_active > 0: # Updates the Cholesky decomposition of X' X L[n_active, :n_active] = np.dot(X[:, :n_active].T, X[:, lam]) solve_triangular(L[:n_active, :n_active], L[n_active, :n_active]) v = nrm2(L[n_active, :n_active]) ** 2 if 1 - v <= min_float: # selected atoms are dependent warn("Stopping early") break L[n_active, n_active] = np.sqrt(1 - v) X.T[n_active], X.T[lam] = swap(X.T[n_active], X.T[lam]) alpha[n_active], alpha[lam] = alpha[lam], alpha[n_active] indices[n_active], indices[lam] = indices[lam], indices[n_active] n_active += 1 # solves LL'x = y as a composition of two triangular systems gamma, _ = potrs(L[:n_active, :n_active], alpha[:n_active], lower=True, overwrite_b=False) residual = y - np.dot(X[:, :n_active], gamma) if eps is not None and nrm2(residual) ** 2 <= eps: break elif n_active == max_features: break return gamma, indices[:n_active]

Now, the benchmark at [1] indicates victory over least-angle regression! I hope you have enjoyed this short tour. See you next time!

]]>I will go through the process of developing this particular piece of code as an example of code refining and iterative improvements, as well as for the useful notes it will provide on optimizing numerical Python code. In the first part we will see how the code got from pseudocode state to a reasonably efficient code with smart memory allocation. In the next part we will see how to make it blazing fast by leveraging [1] lower level BLAS and LAPACK routines, and how to use profiling to find hot spots.

As stated before, orthogonal matching pursuit is a greedy algorithm for finding a sparse solution to a linear regression problem . Mathematically, it approximates the solution of the optimization problem:

subject to

or (under a different parametrization):

subject to

In the code samples in this post I will omit the docstrings, but I will follow the notation in the formulas above.

**Important note:** The regressors/dictionary atoms (the columns of ) are assumed to be normalized throughout this post (as well as usually any discussion of OMP). We also assume the following imports:

import numpy as np

from scipy import linalg

Orthogonal matching pursuit is a very simple algorithm in pseudocode, and as I stated before, it almost writes itself in Numpy. For this reason, instead of stating the pseudocode here, I will start with how naively implemented OMP looks like in Python:

def orthogonal_mp(X, y, n_nonzero_coefs, eps=None): residual = y idx = [] if eps == None: stopping_condition = lambda: len(idx) == n_nonzero_coefs else: stopping_condition = lambda: np.inner(residual, residual) <= eps while not stopping_condition(): lam = np.abs(np.dot(residual, X)).argmax() idx.append(lam) gamma, _, _, _ = linalg.lstsq(X[:, idx], y) residual = y - np.dot(X[:, idx], gamma) return gamma, idx

Using lambda expressions as stopping conditions never looked like a brilliant idea, but it seems to me like the most elegant way to specify such a variable stopping condition. However, the biggest slowdown in this is the need for solving a least squares problem at each iteration, while least-angle regression is known to produce the entire regularization path for the cost of a single least squares problem. We will also see that this implementation is more vulnerable to numerical stability issues.

In [2], Rubinstein et al. described the Cholesky-OMP algorithm, an implementation of OMP that avoids solving a new least squares problem at each iteration by keeping a Cholesky decomposition of the Gram matrix . Because grows by exactly one column at each iteration, can be updated according to the following rule: Given , and knowing the decomposition of , the Cholesky decomposition is given by where .

Even if you are unfamiliar with the mathematical properties of the Cholesky decomposition, you can see from the construction detailed above that is always going to be a lower triangular matrix (it will only have null elements above the main diagonal). Actually, the letter L stands for lower. We have therefore replaced the step where we needed to solve the least-squares problem with two much simpler computations: solving and solving . Due to the ‘s structure, these are much quicker operations than a least squares projection.

Here is the initial way I implemented this:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None): if eps == None: stopping_condition = lambda: it == n_nonzero_coefs else: stopping_condition = lambda: np.inner(residual, residual) <= eps alpha = np.dot(X.T, y) residual = y idx = [] L = np.ones((1,1)) while not stopping_condition(): lam = np.abs(np.dot(residual, X)).argmax() if len(idx) > 0: w = linalg.solve_triangular(L, np.dot(X[:, idx].T, X[:, lam]), lower=True) L = np.r_[np.c_[L, np.zeros(len(L))], np.atleast_2d(np.append(w, np.sqrt(1 - np.dot(w.T, w))))] idx.append(lam) # solve LL'x = y in two steps: Ltx = linalg.solve_triangular(L, alpha[idx], lower=True) gamma = linalg.solve_triangular(L, Ltx, trans=1, lower=True) residual = y - np.dot(X[:, idx], gamma) return gamma, idx

Note that a lot of the code remained unchanged, this is the same algorithm as before, only the Cholesky trick is used to improve performance. According to the plot in [3], we can see that the naive implementation has oscillations of the reconstruction error due to numerical instability, while this Cholesky implementation is well-behaved.

Along with this I also implemented the Gram-based version of this algorithm, which only needs and (and , in case the epsilon-parametrization is desired). This is called **Batch OMP** in [2], because it offers speed gains when many signals need to be sparse coded against the same dictionary . A lot of speed is gained because two large matrix multiplications are avoided at each iteration, but for many datasets, the cost of the precomputations dominates the procedure. I will not insist on Gram OMP in this post, it can be found in the `scikits.learn`

repository [4].

Now, the problems with this are a bit more subtle. At this point, I moved on to code other things, since OMP was passing tests and the signal recovery example was working. The following issues popped up during review:

1. The lambda stopping condition does not pickle.

2. For well-constructed signals and data matrices, assuming normal atoms, on line 14 will never have norm greater than or equal to zero, unless the chosen feature happens to be dependent of the already chosen set. In theory, this cannot happen, since we do an orthogonal projection at each step. However, if the matrix is not well-behaved (for example, if it has two identical columns, and is built using non-zero coefficients for those columns), then we end up with the square root of a negative value on line 17.

3. It was orders of magnitude slower than least-angle regression, given the same number of nonzero coefficients.

1 was an easy fix. 2 was a bit tricky since it was a little hidden: the first time I encountered such an error, I wrongfully assumed that given that the diagonal of was unit, then should also have a unit diagonal, so I passed the parameter `unit_diagonal=True`

to `linalg.solve_triangular`

, and the plethora of NaN’s along the diagonal were simply ignored. Let this show what happens when you don’t pay attention when coding.

When I realized my mistake, I first did something I saw in `lars_path`

from the scikit: take the absolute value of the argument of `sqrt`

, and also ensure it is practically larger than zero. However, tests started failing randomly. Confusion ensued until the nature of the issue, discussed above, was discovered. It’s just not right to take the `abs`

: if that argument ends up less than zero, OMP simply cannot proceed and must stop due to malformed data. The reference implementation from the website of the authors of [2] includes explicit *early stopping* conditions for this, along with some other cases.

At the same time, I started to try a couple of optimizations. The most obvious thing was the way I was building the matrix was clearly suboptimal, reallocating it at each iteration.

This leads to the following code:

def cholesky_omp(X, y, n_nonzero_coefs, eps=None): min_float = np.finfo(X.dtype).eps alpha = np.dot(X.T, y) residual = y n_active = 0 idx = [] max_features = X.shape[1] if eps is not None else n_nonzero_coefs L = np.empty((max_features, max_features), dtype=X.dtype) L[0, 0] = 1. while 1: lam = np.abs(np.dot(X.T, residual)).argmax() if lam < n_active or alpha[lam] ** 2 < min_float: # atom already selected or inner product too small warn("Stopping early") break if n_active > 0: # Updates the Cholesky decomposition of X' X w = linalg.solve_triangular(L[:n_active, :n_active], np.dot(X[:, idx].T, X[:, lam]), lower=True) L[n_active, :n_active] = w d = np.dot(w.T, w) if 1 - d <= min_float: # selected atoms are dependent warn("Stopping early") break L[n_active, n_active] = np.sqrt(1 - d) idx.append(lam) # solve LL'x = y in two steps: Ltx = linalg.solve_triangular(L[:n_active, :n_active], alpha[idx], lower=True) gamma = linalg.solve_triangular(L[:n_active, :n_active], Ltx, trans=1, lower=True) residual = y - np.dot(X[:, idx], gamma) if eps is not None and np.dot(residual.T, residual) <= eps: break elif n_active == max_features: break return gamma, idx

What should be noted here, apart from the obvious fix for #1, are the early stopping conditions. It is natural to stop if the same feature gets picked twice: the residual is always orthogonalized with respect to the chosen basis, so the only way this could happen is if there would be no more unused independent regressors. This would either lead to this, or to the stopping criterion on line 25, depending on which equally insignificant vector gets picked. The other criterion for early stopping is if the chosen atom is orthogonal to y, which would make it uninformative and would again mean that there are no better ones left, so we might as well quit looking.

Also, we now make sure that is preallocated. Note that `np.empty`

is marginally faster than `np.zeros`

because it does not initialize the array to zero after allocating, so the untouched parts of the array will contain whatever happened to be in memory before. In our case, this means only the values above the main diagonal: everything on and beneath is initialized before access. Luckily, the `linalg.solve_triangular`

function ignores what it doesn’t need.

This is a robust implementation, but still a couple of times slower than least-angle regression. In the next part of the article we will see how we can make it beat LARS.

[1] I always wanted to use this word in a serious context

[2] Rubinstein, R., Zibulevsky, M. and Elad, M., Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit Technical Report – CS Technion, April 2008.

[3] First thoughts on Orthogonal Matching Pursuit on this blog.

[4] omp.py on github.

This has helped find a couple of bottlenecks. Time has been gained by preallocating the array to store the Cholesky decomposition. Also, using the LAPACK `potrs`

function in order to solve a system of the shape is faster than using `solve_triangular`

twice.

I am still trying to optimize the code. We are working hard to make sure that scikits.learn contributions are up to standards before merging.

]]>`scikits.learn`

repository.
You can use it if you install the bleeding edge `scikits.learn`

git version, by first downloading the source code as explained in the user’s guide, and then running `python setup.py install`

.

To see what code is needed to produce an image such as the one above, using

`scikits.learn`

. check out this cool decomposition example that compares the results of most matrix decomposition models implemented at the moment.
There are other new cool things that have been recently merged by other contributors, such as support for sparse matrices in minibatch K-means, and the variational infinite gaussian mixture model, so I invite you to take a look!

]]>One of the simplest, and yet most heavily constrained form of matrix factorization, is vector quantization (VQ). Heavily used in image/video compression, the VQ problem is a factorization where (our dictionary) is called the codebook and is designed to cover the cloud of data points effectively, and each line of is a unit vector.

This means that each each data point is approximated as . In other words, the closest row vector (codeword/dictionary atom) of is chosen as an approximation, and this is encoded as a unit vector . The data representation is composed of such vectors.

There is a variation called gain-shape VQ where instead of approximating each point as one of the codewords, we allow a scalar multiplication invariance: . This model requires considerably more storage (each data point needs a floating point number and an unsigned index, as opposed to just the index), but it leads to a much better approximation.

Gain-shape VQ can equivalently be accomplished by normalizing each data vector prior to fitting the codebook.

In order to fit a codebook for efficient VQ use, the K-Means Clustering [1] algorithm is a natural thought. K-means is an iterative algorithm that incrementally improves the dispersion of k cluster centers in the data space until convergence. The cluster centers are initialized in a random or procedural fashion, then, at each iteration, the data points are assigned to the closest cluster center, which is subsequently moved to the center of the points assigned to it.

The `scikits.learn.decomposition.KMeansCoder`

object from our work-in-progress dictionary learning toolkit can learn a dictionary from image patches using the K-Means algorithm, with optional local contrast normalization and a PCA whitening transform. Using a trained object to transform data points with orthogonal matching pursuit, with the parameter `n_atoms=1`

is equivalent to gain-shape VQ. Of course you are free to use any method of sparse coding such as LARS. The code used to produce the example images on top of this post can be found in [2].

Using K-Means for learning the dictionary does not optimize over linear combinations of dictionary atoms, like standard dictionary learning methods do. However, it’s considerably faster, and Adam Coates and Andrew Ng suggest in [3] that as long as the dictionary is filled with a large enough number of atoms and it covers well enough the cloud of data (and of future test data) points, then K-Means, or even random sampling of image patches, can perform remarkably well for some tasks.

]]>