Spacial and OA Convolution in Python

In UAV club last semester I was researching CNNs to perform computer vision tasks on our drone. My thought at the time was that a CNN trained on generated data, which would represent all of the possible combinations of targets (along with a bunch of other processing to mimic the camera’s resolution limits, lighting stabilization for different times of day and cloud cover, etc.), would be an excellent classifier. Unfortunately, we didn’t have time to put one together before the deadline. It’s still an intriguing concept that I think is worth far more study.

Because images are convolved many hundreds or thousands of times to either train or simply run a CNN, it’s important to use a very fast convolution algorithm to speed up the process. The goal of this post is to talk about algorithms in Python that will convolve a ~1Mp image in less than 60 seconds (which is still quite a bit of time). I’ve also included some discussion and comparisons of using Numba, which speeds up functions that perform heavy mathematical operations in Python. The idea is to take such operations and place them in their own functions, then use a decorator @jit to compile the function and speed up the process. It’s also possible to add another decorator to pre-process, extend functionality or check the input arguments (as I have).

Python

We’ll start with a convolution class to hold the various algorithms. The end goal is to create a plot of array size vs. time for each method. We’ll also utilize a few libraries, namely NumPy’s FFT, trange from tqdm to watch the algorithms’ progress, the @jit decorator from Numba, and several threading functions, per Chris Keihl’s post on threading (I’ll talk more about this in a bit).

I thought it best to store the shapes as attributes since they’ll be accessed a lot; copious amounts of arithmetic will be performed upon these values in the more-involved algorithms. The array is also padded on each side with half the width of the kernel to allow convolution right to the edges. This preserves the dimensions of the final image.

spaceConv

I’ll start with a basic implementation of this algorithm; for those who haven’t written it before, I’ll be very explicit and demonstrate how straightforward the concept is.

Suppose we wish to convolve an array

A=\begin{bmatrix}a_{11}&a_{12}&a_{13}&a_{14}\\a_{21}&a_{22}&a_{23}&a_{24}\\a_{31}&a_{32}&a_{33}&a_{34}\\a_{41}&a_{42}&a_{43}&a_{44}\end{bmatrix}

with a kernel (some also refer to it as a filter)

\Phi=\begin{bmatrix}\phi_{11}&\phi_{12}&\phi_{13}\\\phi_{21}&\phi_{22}&\phi_{23}\\\phi_{31}&\phi_{32}&\phi_{33}\end{bmatrix}.

As performed in the class __init__, we first pad this array with \lfloor\dfrac{\text{col},(\Phi)}{2}\rfloor=1 zeros on the left and right sides, and \lfloor\dfrac{\text{row},(\Phi)}{2}\rfloor=1 zeros on the top and bottom. This method is easiest if the dimensions of \Phi are odd; because we’ve ensured that the dimensions work out nicely, we obtain an array

\tilde{A}=\begin{bmatrix}0&0&0&0&0&0\\0&a_{11}&a_{12}&a_{13}&a_{14}&0\\0&a_{21}&a_{22}&a_{23}&a_{24}&0\\0&a_{31}&a_{32}&a_{33}&a_{34}&0\\0&a_{41}&a_{42}&a_{43}&a_{44}&0\\0&0&0&0&0&0\end{bmatrix}.

As I mentioned at the outset, this new array allows the filter to go to the very edge of the original, and further to preserve the original dimensions of the image.

Now for the convolution step. We compute the following, yielding a 5\times 5 array, which is the same size as A (why we padded it with 0’s).

(\tilde{A}\otimes\Phi)_{ij}=\displaystyle\sum_{k=1}^{\text{row},(\Phi)}\displaystyle\sum_{t=1}^{\text{col},(\Phi)}\tilde{A}_{k+i-1,t+j-1}\Phi_{kt}

=\displaystyle\sum_{k=1}^{3}\displaystyle\sum_{t=1}^{3}\tilde{A}_{k+i-1,t+j-1}\Phi_{kt}.

The entry (\tilde{A}\otimes\Phi)_{11}, for example, would be

\begin{matrix}&0\times\phi_{11}&+&0\times\phi_{12}&+&0\times\phi_{13}\\+&0\times\phi_{21}&+&a_{11}\times\phi_{22}&+&a_{12}\times\phi_{23}\\+&0\times\phi_{31}&+&a_{21}\times\phi_{32}&+&a_{22}\times\phi_{33}\end{matrix}

Many people visualize this method as sliding \Phi along \tilde{A}, performing the matrix dot product, then updating the corresponding entry of A with that value. This product allows us to rewrite these two loops (sums, and borrowing a little of Python’s slicing syntax in the equations) as

(\,\tilde{A}\otimes\Phi\,)_{ij}=\tilde{A}[\,i-1:i+\text{row}\,(\Phi)-1,j-1:j+\text{col}\,(\Phi)-1\,]:\Phi,

but we’ll still need to loop over the kernel at some level to produce it.

The only trouble with this algorithm is the poor time complexity of \mathcal{O}(N^2n^2), where N^2 is similar to the size of A, and n^2 is similar to the size of \Phi. Programmatically, we will need four nested loops to perform convolution in this manner, which makes it a slow process.

I’ve implemented this algorithm in pure Python as the following method, which is, again, pretty slow.

The following additional example is the same as the method above, I’ve just written a nested function to emphasize the dot product I mentioned earlier. In the following graph it’s apparent that this method is slightly slower than that above (probably due to the extra N^2 function calls).

As you can see, there are two outer loops that iterate over the entire original image size, then two inner loops that perform the matrix dot product. As it turns out, these methods are extremely slow.

One way we can speed up the dot function is by using NumPy.

We could also compile this function with Numba as follows.

I found in testing that Numba provides a marked improvement over pure Python and NumPy after the first run, as the graph below demonstrates.

numba_comparison

After the first run, Numba out-paced NumPy by about an order of magnitude; and both out-paced pure Python by several orders of magnitude.

The @checkarray decorator has similar functionality to the far more common @accepts decorator – it makes sure the input arrays have the same dimensions.

Such a decorator is useful while testing the algorithms to ensure that the subset and kernel are the same size prior to attempting the product.

Another method we might use to reduce time complexity by a constant factor is threading, but in my own implementations it’ll cost us the brevity of the former methods.

By dividing the image into an even number of chunks we should be able to spread the computations over multiple cores. For a small number of cores (my laptop only has 2), the following will suffice; unsurprisingly, most of the lines are dedicated to solving for the optimal division. I’ve written divider (below) to solve for the optimal axis over which to partition.

(update Aug 10) It turns out that multiprocessing.dummy is restricted by the GIL, which explains why I wasn’t seeing any performance increase over the regular Numba implementation. I’ve updated the code below (and will continue updating it as I make simplifications and improvements) to use Pathos, which is a great library for parallel computation.

An example convolution with a Gaussian filter (which has a blurring effect) is below. The image array is 1000×596=596,000px, or 0.596Mp (so not too large), and it takes less than 8 seconds to process on my laptop with Numba. Without Numba or NumPy it takes almost 10 minutes!

convolved_spider_jit_1

An example image convolved with Numba and a 17×17 Gaussian kernel.

In the following graph we see the impact of increasing kernel scale on time for each method.

Kernel_Size_vs_spaceConv_time1

Numba allows an almost dependable amount of time per iteration, with NumPy being an almost constant half order of magnitude slower; threading Numba with Pathos on subsets of the original image makes the process even faster. The pure Python implementations are also almost unusable given their dependence upon small kernel size.

As suggested by the complexity, the amount of time to complete each iteration increases roughly quadratically as the kernel size increases, which is far easier to see in the pure Python implementations of spaceConv.

OAconv

Because our goal is to convolve every pixel, I don’t think we can get below the N^2-part of this complexity. If you’re aware of a method that does, please feel free to let me know. This complexity is similar to that of the RX algorithm, because I think I was able to cut down on the overhead cost without too much sway in the resultant covariance matrix; however, since we apply the Mahalanobis metric to every single pixel, there’s a lower bound of complexity \Omega(N^2). (There are other very interesting posts online that have remarkable visuals and content, e.g. this one on blog.bogatron.net.)

In convolution the application of a filter to a subset of the array is an ‘overhead cost’. As you may have supposed from the title, we can reduce the time complexity of this inner-loop with the Convolution Theorem (CT). In a nutshell, the CT says that, for image functions I_{1}, I_{2},

I_{1}\otimes I_{2}=\mathscr{F}^{-1}[\mathscr{F}[I_1\otimes I_2]]

=\mathscr{F}^{-1}[\mathscr{F}[I_{1}]\times\mathscr{F}[I_{2}]].

In words, the Fourier transform \mathscr{F} of I_{1} convolved with I_{2} is equal to the product of the Fourier transform of either image. There are a lot of extra details that come with discrete I_{1}, I_{2}, but the underlying concept is the same. From NumPy we imported the Fast Fourier Transform (FFT) and inverse Fast Fourier Transform (iFFT) to transform image arrays, and we may multiply arrays together as we would two numbers.

The overlap and add convolution algorithm, referred to as OaAconv in [1], uses the FFT to simplify the convolution process to a Hadamard product in the frequency domain, reducing the complexity to \mathcal{O}(N^2\log(n)). An additional overhead cost is incurred by transforming each image subset, then the inverse after the entry-wise product. While I’ve found resources that give one dimensional examples and figures in a number of places, very few have explained the two dimensional case in detail, including [1]. After a little research I came across this description, which clearly outlines the algorithm’s steps.

We first take the array A and pad it on each side with enough 0’s so that \text{row}\,(\Phi)\,|\,\text{row}\,(\tilde{A}) and \text{col}(\Phi)\,|\,\text{col}\,(\tilde{A}). Total padding along the x– and y-axes is given by

X_{\text{pad}}=\left(\text{row}\,(\Phi) - \text{row}\,(A) + \text{row}\,(\Phi)\lfloor\dfrac{\text{row}\,(A)}{\text{row}\,(\Phi)}\rfloor\right)\!\!\!\!\mod\text{row}\,(\Phi),

and Y_{\text{pad}}=\left(\text{col}\,(\Phi) - \text{col}\,(A) + \text{col}\,(\Phi)\lfloor\dfrac{\text{col}\,(A)}{\text{col}\,(\Phi)}\rfloor\right)\!\!\!\!\mod\text{col}\,(\Phi),

respectively. Using A above, we see visually that we would have to pad it with a single border of 0‘s all the way around. The equations above yield

X_{\text{pad}}=3-4+3\times\lfloor 4/3 \rfloor\!\!\!\mod 3

=2,

Y_{\text{pad}}=3-4+3\times\lfloor 4/3 \rfloor\!\!\!\mod 3

=2.

Then splitting both of these numbers in half, we obtain

X_{\text{pad left}}=X_{\text{pad}}//2=1,

X_{\text{pad right}}=X_{\text{pad}}-X_{\text{pad left}}=1,

X_{\text{pad top}}=X_{\text{pad}}//2=1,

X_{\text{pad bottom}}=X_{\text{pad}}-X_{\text{pad top}}=1,

which is the same as the single border of 0‘s we were expecting! Thus A becomes, again,

jfv9xva

and I’ve colored the four sub-arrays that result from partitioning \tilde{A} into blocks the size of the kernel, \Phi. We’ll work with the upper two blocks from here on out.

The next step in this algorithm is to pad the arrays and kernel until both are twice the size of the kernel minus one in either direction. In this case that amounts to an additional border of 0‘s (since 2\,\text{col}\,(\Phi)-1=2\,\text{row}\,(\Phi)-1=5), yielding the arrays

z2see7w

and a kernel

gs2n8as

Because I’d rather not analyze this portion symbolically (it’d be a rather confusing mess of work), I’ll simply state that we now compute the FFT of all three arrays; luckily, we do not have to transform the padded kernel more than once. Then multiply them together entry-wise and take the iFFT.

t=\Re(\text{iFFT}\,(\text{FFT}\,(\tilde{A}')\odot\text{FFT}\,(\Phi'))).

Because there is always some negligible residual imaginary component left over from the iFFT, I’ve taken the real part. Now to the algorithm in Python 😛

I would like to split this process into multiple threads eventually, but due to the unexpected results with Numba above, I think I’ll wait until I’m able to resolve them. By superposition in the OA portion of this algorithm, it should be very straightforward.

This method’s time complexity is thrown off a little because NumPy isn’t written completely in Python. Also, fewer iterations are necessary as the kernels get larger, which makes it very fast for larger kernels.

Kernel_Size_vs_OAconv_time

The shape resembles 1/n^2; far more operations are performed inherently by NumPy than Python as the kernels get larger, and fewer iterations are necessary to cover the padded image array, as I’ve already mentioned.

convolved_spider_oaconv_1

An example OA convolved image with an 18×18 kernel.

I would also like to thank the user ead on SO for helping me troubleshoot a problem that stalled progress for several weeks at the beginning of the summer.

For a comparison with this method I’ll try convolving with scipy.ndimage.filters.convolve and scipy.signal.fftconvolve.

comparison-2

A comparison of various SciPy convolution algorithms against my Python implementation of OA convolution. Both fftconvolve and convolve are wicked fast!

Sources

[1] Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add, by T. Hylander and A. Rodriguez