Voyageur Technologique

Faster Scientific Python

When you're computation-bound, what are your options?

[epistemic status: hastily written, high-level overview lacking practical grounding]

Bottlenecks

Python isn’t known for it’s speed. It has a purposefully dumb interpreter which gets blown out of the water (in terms of throughput, memory consumption, and start-up speed) by almost all compiled languages. This is usually fine, because you’re just trying to ship a gosh-darned website or quickly whip together a data-cleaning pipeline. Typically, other bottlenecks, such as network speed, memory access times, algorithmic approach, or the ability to scale across multiple machines take precedence.

But sometimes the throughput of your code and it’s memory consumption on a single CPU-only machine is the primary limitation. In my experience, this typically happens when you’re dealing with scientific code. Specifically, when you’re operating on large collections of array-structured data, often using the Numpy and Pandas libraries.

For example, Python has been a computational bottleneck for me when I was implementing Dynamic Time Warping and when developing the Nengo CPU backend.

Getting the details right

Before going wild with alternative interpreters, libraries or even a totally new language, it’s important to make sure you’re using Numpy and Pandas correctly.

Proper use of Numpy and Pandas APIs

I assume you’ve already seen for yourself how Numpy and Pandas are faster than raw Python code. The internals of how they do things faster is outside of the scope of this post. Basically, Pandas and Numpy store tables and arrays in a special format which allows for quick access from memory, takes up little space and run super fast in parallel using specialized CPU instructions.

However, Numpy and Pandas are only fast for certain use cases. Specifically, when writing Numpy and Pandas code, it is important to:

  1. Avoid loops; they’re slow and often unnecessary.

  2. If you must loop, use built-in iterators, such as group_by() and apply(), instead of iterating over structures.

  3. Operate on the whole array, series or DataFrame without indexing or iteration. This is typically called “vectorisation”.

This is covered in greater depth in this Pandas and this Numpy blog-post. If you’re using the Numpy and Pandas libraries, you gotta use them right.

If you’re having trouble with vectorisation and using features correctly, I recommend posting questions on Code Review StackExchange. It helped me understand Numpy vectorisation techniques and Pandas group_by().

Backends

Pandas and Numpy are essentially high-level interfaces to libraries of highly optimized numerical operations, such as MKL, OpenBLAS and Atlas. The performance of these libraries is extremely hardware-dependent. For example, MKL tends to work poorly on AMD hardware and should be replaced with OpenBLAS. In my experience, the easiest way to do this is by installing numpy with conda. However, if you really need every ounce of performance possible, each backend should be profiled and considered.

Sometimes using vectorized Numpy code isn’t enough and you need to get weird. For example, Nengo implemented a computational graph compiler, discussed in this PR and this paper. Basically, many small Numpy computations were combined into fewer Numpy computations operating on larger arrays. This allowed the code to spend less time in Python and more time running Numpy routines.

Although carefully using and configuring Numpy and Pandas gets you quite far, you may eventually need to attack Python’s interpreter speed itself.

Desiderata for faster Python

Ideally, any method that makes Python faster, should have the following features:

  • Easy to integrate with Python
  • Easy to debug, ideally with a graphical debugger
  • Has a good community, either on StackOverflow or a custom forum

I: Using a different interpreter

Python’s default interpreter is CPython. There are alternative interpreters, such as Jython (for integrating with Java applications) and IronPython (for integrating with .Net applications), but the most up-to-date interpreter is PyPy. PyPy makes Python a JIT compiled language, which greatly improves it’s performance, both in terms of computational throughput and memory usage. However, PyPy has limited compatibility with CPython C API, which means certain modules cannot be run. For example, as of time of writing, Matplotlib is not supported, but support for Numpy was recently added.

This was attempted with Nengo a couple of years ago, but certain Numpy functionality was not supported. I would definitely try it for a new project.

II: Staying in the Python ecosystem

mypy.c

Python supports type annotations (also called “type-hints”) using MyPy, which has been included in the core Python release, as of 3.5. Although the CPython interpreter cannot be made faster with type-hints, type-hints can be used to compile Python code into Python C extensions.

This is currently being used in Black, the opinionated Python auto-formatter.

Cython

If your Python code runs, then your Cython code will run. But you won’t be able to debug it? Cython has explicit support for Numpy and Pandas datatypes.

There is the cydb, but that isn’t a graphical debugger. I never got comfortable with command-line debuggers, but Cython may force me to finally make the plunge.

Numba

Numba is weird and I don’t understand it. But it helped make Nengo way faster.

For more on using Pandas with Cython and Numba, see the official docs.

Summary: There is no panacea

If you aren’t using incompatible libraries, PyPy is a good idea.

If you have a small standalone function, you should probably start with Cython or Numba.

mypy.c is a really exciting tool, but is still experimental.

If the function causing the bottleneck looks like it can keep expanding in scope, you probably want to go to another language. However, that’s a topic for another blog post, once I’ve acquire more practical experience in the domain.