In UAV club last semester I was researching CNNs to perform computer vision tasks on our drone. My thought at the time was that a CNN trained on generated data, which would represent all of the possible combinations of targets (along with a bunch of other processing to mimic the camera’s resolution limits, lighting stabilization for different times of day and cloud cover, etc.), would be an excellent classifier. Unfortunately, we didn’t have time to put one together before the deadline. It’s still an intriguing concept that I think is worth far more study.

Because images are convolved many hundreds or thousands of times to either train or simply run a CNN, it’s important to use a very fast convolution algorithm to speed up the process. The goal of this post is to talk about algorithms in Python that will convolve a ~1Mp image in less than 60 seconds (which is still quite a bit of time). I’ve also included some discussion and comparisons of using Numba, which speeds up functions that perform heavy mathematical operations in Python. The idea is to take such operations and place them in their own functions, then use a decorator @jit to compile the function and speed up the process. It’s also possible to add another decorator to pre-process, extend functionality or check the input arguments (as I have).

### Python

We’ll start with a convolution class to hold the various algorithms. The end goal is to create a plot of array size vs. time for each method. We’ll also utilize a few libraries, namely NumPy’s FFT, trange from tqdm to watch the algorithms’ progress, the @jit decorator from Numba, and several threading functions, per Chris Keihl’s post on threading (I’ll talk more about this in a bit).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
import numpy as np from numba import jit from tqdm import trange, tqdm from numpy.fft import fft2 as FFT, ifft2 as iFFT from numpy.fft import rfft2 as rFFT, irfft2 as irFFT from numpy.fft import fftn as FFTN, ifftn as iFFTN #from multiprocessing import Pool as ThreadPool from pathos.multiprocessing import ProcessPool from psutil import cpu_count class convolve(object): """ contains methods to convolve two images """ def __init__(self, image_array, kernel): self.array = image_array self.kernel = kernel self.__rangeX_ , self.__rangeY_ = image_array.shape self.__rangeKX_, self.__rangeKY_ = kernel.shape # to be returned instead of the original self.__arr_ = np.zeros(image_array.shape) # pad array for convolution self.__offsetX_ = self.__rangeKX_ // 2 self.__offsetY_ = self.__rangeKY_ // 2 self.array = np.lib.pad(self.array , [(self.__offsetY_, self.__offsetY_), (self.__offsetX_, self.__offsetX_)] , mode='constant' , constant_values=0 ) |

I thought it best to store the shapes as attributes since they’ll be accessed a lot; copious amounts of arithmetic will be performed upon these values in the more-involved algorithms. The array is also padded on each side with half the width of the kernel to allow convolution right to the edges. This preserves the dimensions of the final image.

### spaceConv

I’ll start with a basic implementation of this algorithm; for those who haven’t written it before, I’ll be very explicit and demonstrate how straightforward the concept is.

Suppose we wish to convolve an array

with a kernel (some also refer to it as a filter)

.

As performed in the class __init__, we first pad this array with zeros on the left and right sides, and zeros on the top and bottom. This method is easiest if the dimensions of are odd; because we’ve ensured that the dimensions work out nicely, we obtain an array

.

As I mentioned at the outset, this new array allows the filter to go to the very edge of the original, and further to preserve the original dimensions of the image.

Now for the convolution step. We compute the following, yielding a array, which is the same size as (why we padded it with 0’s).

.

The entry , for example, would be

Many people visualize this method as sliding along , performing the matrix dot product, then updating the corresponding entry of with that value. This product allows us to rewrite these two loops (sums, and borrowing a little of Python’s slicing syntax in the equations) as

,

but we’ll still need to loop over the kernel at *some level* to produce it.

The only trouble with this algorithm is the poor time complexity of , where is similar to the size of , and is similar to the size of . Programmatically, we will need four nested loops to perform convolution in this manner, which makes it a slow process.

I’ve implemented this algorithm in pure Python as the following method, which is, again, pretty slow.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def spaceConv2(self): """ normal convolution, O(N^2*n^2). This is usually too slow """ # this is the O(N^2) part of this algorithm for i in trange(self.__rangeX_): for j in xrange(self.__rangeY_): # Now the O(n^2) portion total = 0.0 for k in xrange(self.__rangeKX_): for t in xrange(self.__rangeKY_): total += self.kernel[k][t] * self.array[i+k, j+t] # Update entry in self.__arr_, which is to be returned # http://stackoverflow.com/a/38320467/3928184 self.__arr_[i, j] = total return self.__arr_ |

The following additional example is the same as the method above, I’ve just written a nested function to emphasize the dot product I mentioned earlier. In the following graph it’s apparent that this method is slightly slower than that above (probably due to the extra function calls).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
def spaceConvDot2(self): """ Exactly the same as the former method, just contains a nested function so the dot product appears more obvious """ def dot(ind, jnd): """ perform a simple 'dot product' between the 2 dimensional image subsets. """ total = 0.0 # This is the O(n^2) part of the algorithm for k in xrange(self.__rangeKX_): for t in xrange(self.__rangeKY_): total += self.kernel[k][t] * self.array[k+ind, t+jnd] return total # this is the O(N^2) part of the algorithm for i in trange(self.__rangeX_): for j in xrange(self.__rangeY_): self.__arr_[i, j] = dot(i, j) return self.__arr_ |

As you can see, there are two outer loops that iterate over the entire original image size, then two inner loops that perform the matrix dot product. As it turns out, these methods are extremely slow.

One way we can speed up the dot function is by using NumPy.

1 2 3 4 |
@checkarrays def dotNumpy(subarray, kernel): """ Perform the matrix dot product with NumPy """ return np.sum(subarray * kernel) |

We could also *compile* this function with Numba as follows.

1 2 3 4 5 6 7 8 9 10 11 |
@checkarrays @jit def dotJit(subarray, kernel): """ perform a simple 'dot product' between the 2 dimensional image subsets. """ total = 0.0 for i in xrange(subarray.shape[0]): for j in xrange(subarray.shape[1]): total += subarray[i][j] * kernel[i][j] return total |

I found in testing that Numba provides a marked improvement over pure Python and NumPy after the first run, as the graph below demonstrates.

The @checkarray decorator has similar functionality to the far more common @accepts decorator – it makes sure the input arrays have the same dimensions.

1 2 3 4 5 6 |
def checkarrays(f): """ Similar to the @accepts decorator """ def new_f(*args, **kwd): assert reduce(lambda x, y: x == y, map(np.shape, args)) return f(*args, **kwd) return new_f |

Such a decorator is useful while testing the algorithms to ensure that the subset and kernel are the same size prior to attempting the product.

Another method we might use to reduce time complexity by a constant factor is threading, but in my own implementations it’ll cost us the brevity of the former methods.

By dividing the image into an even number of *chunks* we should be able to spread the computations over multiple cores. For a small number of cores (my laptop only has 2), the following will suffice; unsurprisingly, most of the lines are dedicated to solving for the optimal division. I’ve written
divider (below) to solve for the optimal axis over which to partition.

*(update Aug 10) *It turns out that
multiprocessing.dummy is restricted by the GIL, which explains why I wasn’t seeing any performance increase over the regular Numba implementation. I’ve updated the code below (and will continue updating it as I make simplifications and improvements) to use Pathos, which is a great library for parallel computation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
def spaceConvNumbaThreadedOuter2(self): """ `Block` threading example """ def divider(arr_dims, coreNum=1): """ Get a bunch of iterable ranges; Example input: [[[0, 24], [15, 25]]]""" if (coreNum == 1): return arr_dims elif (coreNum < 1): raise ValueError(\ 'partitioner expected a positive number of cores, got %d'\ % coreNum ) elif (coreNum % 2): raise ValueError(\ 'partitioner expected an even number of cores, got %d'\ % coreNum ) total = [] # Split each coordinate in arr_dims in _half_ for arr_dim in arr_dims: dY = arr_dim[0][1] - arr_dim[0][0] dX = arr_dim[1][1] - arr_dim[1][0] if ((coreNum,)*2 > (dY, dX)): coreNum = max(dY, dX) coreNum -= 1 if (coreNum % 2 and coreNum > 1) else 0 new_c1, new_c2, = [], [] if (dY >= dX): # Subimage height is greater than its width half = dY // 2 new_c1.append([arr_dim[0][0], arr_dim[0][0] + half]) new_c1.append(arr_dim[1]) new_c2.append([arr_dim[0][0] + half, arr_dim[0][1]]) new_c2.append(arr_dim[1]) else: # Subimage width is greater than its height half = dX // 2 new_c1.append(arr_dim[0]) new_c1.append([arr_dim[1][0], half]) new_c2.append(arr_dim[0]) new_c2.append([arr_dim[1][0] + half, arr_dim[1][1]]) total.append(new_c1), total.append(new_c2) # If the number of cores is 1, we get back the total; Else, # we split each in total, etc.; it's turtles all the way down return divider(total, coreNum // 2) @checkarrays @jit def dotJit(subarray, kernel): total = 0.0 for i in xrange(subarray.shape[0]): for j in xrange(subarray.shape[1]): total += subarray[i][j] * kernel[i][j] return total def outer(subset): a, b, = subset ai, bi, = map(sub, *reversed(zip(*subset))) temp = np.zeros((ai, bi)) for ind, i in enumerate(xrange(*a)): for jnd, j in enumerate(xrange(*b)): temp[ind, jnd] = dotJit(\ self.array[i:i+self.__rangeKX_, j:j+self.__rangeKY_] , self.kernel ) return temp, a, b # ProcessPool auto-detects processors, but my function above # only accepts an even number; I'm still working on it. cores = cpu_count() cores -= 1 if (cores % 2 == 1 and cores > 1) else 0 # Get partitioning indices and the usable number of cores shape = [[[0, self.__rangeX_ - 1], [0, self.__rangeY_ - 1]]] partitions = divider(shape, cores) # Map partitions to threads and process pool = ProcessPool(nodes=cores) results = pool.map(outer, partitions) pool.close() pool.join() for ind, res in enumerate(results): X, Y, = results[ind][1:] self.__arr_[slice(*X), slice(*Y)] += results[ind][0] return self.__arr_ |

An example convolution with a Gaussian filter (which has a blurring effect) is below. The image array is 1000×596=596,000px, or 0.596Mp (so not too large), and it takes less than 8 seconds to process on my laptop with Numba. Without Numba or NumPy it takes almost 10 minutes!

In the following graph we see the impact of increasing kernel scale on time for each method.

As suggested by the complexity, the amount of time to complete each iteration increases roughly quadratically as the kernel size increases, which is far easier to see in the pure Python implementations of spaceConv.

### OAconv

Because our goal is to convolve every pixel, I don’t think we can get below the -part of this complexity. If you’re aware of a method that *does*, please feel free to let me know. This complexity is similar to that of the RX algorithm, because I think I was able to cut down on the overhead cost without too much sway in the resultant covariance matrix; however, since we apply the Mahalanobis metric to *every single pixel*, there’s a lower bound of complexity . (There are other very interesting posts online that have remarkable visuals and content, e.g. this one on blog.bogatron.net.)

In convolution the application of a filter to a subset of the array is an ‘overhead cost’. As you may have supposed from the title, we can reduce the time complexity of this inner-loop with the Convolution Theorem (CT). In a nutshell, the CT says that, for image functions , ,

.

In words, the Fourier transform of convolved with is equal to the product of the Fourier transform of either image. There are a lot of extra details that come with discrete , , but the underlying concept is the same. From NumPy we imported the Fast Fourier Transform (FFT) and inverse Fast Fourier Transform (iFFT) to transform image arrays, and we may multiply arrays together as we would two numbers.

The overlap and add convolution algorithm, referred to as OaAconv in [1], uses the FFT to simplify the convolution process to a Hadamard product in the frequency domain, reducing the complexity to . An additional overhead cost is incurred by transforming each image subset, then the inverse after the entry-wise product. While I’ve found resources that give one dimensional examples and figures in a number of places, very few have explained the two dimensional case in detail, including [1]. After a little research I came across this description, which clearly outlines the algorithm’s steps.

We first take the array and pad it on each side with enough 0’s so that and . Total padding along the – and -axes is given by

,

and ,

respectively. Using above, we see visually that we would have to pad it with a single border of ‘s all the way around. The equations above yield

,

.

Then splitting both of these numbers in half, we obtain

,

,

,

,

which is the same as the single border of ‘s we were expecting! Thus becomes, again,

and I’ve colored the four sub-arrays that result from partitioning into blocks the size of the kernel, . *We’ll work with the upper two blocks from here on out.*

The next step in this algorithm is to pad the arrays and kernel until both are twice the size of the kernel minus one in either direction. In this case that amounts to an additional border of ‘s (since ), yielding the arrays

and a kernel

Because I’d rather not analyze this portion symbolically (it’d be a rather confusing mess of work), I’ll simply state that we now compute the FFT of all three arrays; luckily, we do not have to transform the padded kernel more than once. Then multiply them together entry-wise and take the iFFT.

.

Because there is always some negligible residual imaginary component left over from the iFFT, I’ve taken the real part. Now to the algorithm in Python 😛

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
@staticmethod def InvertKernel2(kernel): """ Invert a kernel for an example """ X, Y = kernel.shape # thanks to http://stackoverflow.com/a/38384551/3928184! new_kernel = np.full_like(kernel, 0) for i in xrange(X): for j in xrange(Y): n_i = (i + X // 2) % X n_j = (j + Y // 2) % Y new_kernel[n_i, n_j] = kernel[i, j] return new_kernel def OAconv2(self): """ A threaded version of the former algorithm """ self.__rangePX_, self.__rangePY_ = self.array.shape diffX = (self.__rangeKX_ - self.__rangePX_ + \ self.__rangeKX_ * (self.__rangePX_ //\ self.__rangeKX_)) % self.__rangeKX_ diffY = (self.__rangeKY_ - self.__rangePY_ + \ self.__rangeKY_ * (self.__rangePY_ //\ self.__rangeKY_)) % self.__rangeKY_ # padding on each side, i.e. left, right, top and bottom; # centered as well as possible right = diffX // 2 left = diffX - right bottom = diffY // 2 top = diffY - bottom # pad the array self.array = np.lib.pad(self.array , ((left, right), (top, bottom)) , mode='constant' , constant_values=0 ) print self.__arr_.shape divX = int(self.array.shape[0] // self.__rangeKX_) divY = int(self.array.shape[1] // self.__rangeKY_) # a list of tuples to partition the array by subsets = [(i*self.__rangeKX_, (i + 1)*self.__rangeKX_,\ j*self.__rangeKY_, (j + 1)*self.__rangeKY_)\ for i in xrange(divX)\ for j in xrange(divY)] # padding for individual blocks in the subsets list padX = self.__rangeKX_ // 2 padY = self.__rangeKY_ // 2 # Add. padding for __arr_ so it can store the results self.__arr_ = np.lib.pad(self.__arr_ , ((left+padX+self.__offsetX_, right+padX+self.__offsetX_), (top+padY+self.__offsetY_, bottom+padY+self.__offsetY_)) , mode='constant', constant_values=0 ) kernel = np.pad(self.kernel , [(padX, padX), (padY, padY)] , mode='constant' , constant_values=0 ) # thanks to http://stackoverflow.com/a/38384551/3928184! # Invert the kernel new_kernel = self.InvertKernel2(kernel) transf_kernel = FFT(new_kernel) # transform each partition and OA on conv_image for tup in tqdm(subsets): # slice and pad the array subset subset = self.array[tup[0]:tup[1], tup[2]:tup[3]] subset = np.lib.pad(subset , [(padX, padX), (padY, padY)] , mode='constant' , constant_values=0 ) transf_subset = FFT(subset) # multiply the two arrays entrywise space = iFFT(transf_subset * transf_kernel).real # overlap with indices and add them together self.__arr_[tup[0]:tup[1] + 2 * padX,\ tup[2]:tup[3] + 2 * padY] += space # crop image and get it back, convolved return self.__arr_[ padX+left+self.__offsetX_ :padX+left+self.__offsetX_+self.__rangeX_, padY+bottom+self.__offsetY_ :padY+bottom+self.__offsetY_+self.__rangeY_ ] |

I would like to split this process into multiple threads eventually, but due to the unexpected results with Numba above, I think I’ll wait until I’m able to resolve them. By superposition in the OA portion of this algorithm, it should be very straightforward.

This method’s time complexity is thrown off a little because NumPy isn’t written completely in Python. Also, fewer iterations are necessary as the kernels get larger, which makes it very fast for larger kernels.

The shape resembles ; far more operations are performed inherently by NumPy than Python as the kernels get larger, and fewer iterations are necessary to cover the padded image array, as I’ve already mentioned.

I would also like to thank the user ead on SO for helping me troubleshoot a problem that stalled progress for several weeks at the beginning of the summer.

For a comparison with this method I’ll try convolving with scipy.ndimage.filters.convolve and scipy.signal.fftconvolve.

### Sources

[1] *Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add*, by T. Hylander and A. Rodriguez