As an end-of-semester project I would like to investigate the mathematics of Gaussian and Moffat distributions, which are used to model stars in astronomical data. My inspiration for writing this post came from reading [1] and [5], where many interesting analytics are presented.

Specifically, in this post we’ll generate example PSFs and sources therefrom, extract stellar sources from images taken by the SDSS, and fit Gaussian and Moffat functions to each using a linear gradient descent (GD) algorithm.

### The Gaussian Distribution

We will consider the two dimensional case of the bivariate normal distribution since image arrays are usually two dimensional arrays of pixels.

,

where the covariance matrix will be defined as

so that

,

and thus becomes

Therefore, we arrive at a simplified distribution

,

which is basically a fixed-axis elliptical normal distribution. Furthermore, because the property is rather elegant, we see that

,

and by parameterizing , we obtain

.

Thence the former result is

,

and by yet another substitution one may simplify this integral drastically to

.

If (to obtain the “standard normal distribution”), then the volume of . This is a widely understood identity that easily generalizes to higher dimensions (e.g. see [4] in my last post). On another mathematical note, in reference to [6], I might suspect the multivariate normal distribution to aid in determining the value of , the multivariate gamma function [I will have to investigate this in a future post].

Because we will only be considering the circular Moffat function in the next section, here we will restrict this function even further by setting . This also means that

,

where is the square identity matrix. If you’ll recall from statistics, when we just obtain the Euclidean metric from , so the entire equation simplifies to

,

where denotes the Euclidean norm.

Therefore, the only axes we must solve for in the gradient descent algorithm are translation, i.e. the centroid of a source, and the variance (which is related to the stability of the atmosphere). Feature scaling of the dataset will also be important, because the ADUs won’t be in the range .

### The Moffat Distribution

The authors of [1] present the Gaussian as a limiting case of the more general Moffat distribution, introduced by none other than A. Moffat in 1969 [2]. The circular case is defined as

.

The authors, again, give a very interesting derivation in appendix A which fortifies their claim that

.

I would like to retrace their steps here. But first, as with the former distribution, we shall determine (or simply verify, as it turns out) the conditions for which the volume of is unity. We see that

,

and by the same substitution we used last time, we may simplify this integral to

,

which works out quite nicely. Continuing on our plight to unravel their work in appendix A, we solve for the full width at half maximum (FWHM) of , which may be visualized as the width of the fitted PSF halfway along its magnitude. Thus

(normalized amplitude)

.

The extra factor of came from the fact that we’re trying to find the *width*, whereas the penultimate equation yields the radius (so we multiply it by ). Next, we eliminate in by writing it in terms of , since

.

For this I obtain the expression

.

Now we can make the case that since

,

it would be perfectly fine to substitute as

.

Hence, altogether we have

.

At this point it may appear ridiculously complex. But if you’ll recall the classic identity

(c.f. [4])

we are able to obtain the exponential from the limit

.

Look familiar yet? Well, we’re off by a few constants; letting , as the authors suggest,

.

In the next section we will write scripts that ‘shift and shimmy’ the and distributions about until a reasonable convergence is obtained by the GD algorithm; moreover, we will compare the fits for both the Gaussian and Moffat distributions. We shall also test the authors’ claim that setting in should yield similar results to (I don’t have any doubts).

### Gradient Descent

The gradient descent algorithm is ubiquitous in machine learning. It was originally devised as a first order method to arrive at the local optima of a multivariate function by Isaac Newton through iteration. We will utilize this tool to optimize the parameters of and for various real-world examples of stellar distributions in astronomical data.

The first order algorithm may be written recursively as

,

where in this case, is defined as

(e.g. see 173 of [4]), and is the total size of the image subset containing a star. We will be substituting and for , and their parameter vectors will be and , respectively. I have added to scale either function, which is necessary since scaling the data to does not mitigate the possibility that the amplitude of the best fit is slightly greater than unity. Moreover,

,

so we may simplify the necessary computations slightly.

is the “learning rate” of the algorithm. In the past I’ve just set this to a constant value, but I’ve found that, in practice, a constant vector works a little better when one parameter may depend on others. This requires a little tuning to arrive at a decent (though not quite optimal) rate along each parameter. In the future I hope to implement a simple method to allow the algorithm to tune its own learning rates (I’m still reading about this in the literature).

To begin, let’s compute the 7 partial derivatives. I arrive at

,

,

,

for , and

,

,

,

and ,

for . I’ve written and in Python as follows.

def m2(X, theta):
""" Moffat distribution """
return theta[0]*(theta[2]-1.0)/(np.pi*theta[1]**2.0)*(1.0\
+((X[0]-theta[3])**2.0+(X[1]-theta[4])**2.0)/\
(theta[1]**2.0))**(-theta[2])
def g2(X, theta):
""" Gaussian Distribution """
return theta[0]/(2.0*np.pi*theta[1]**2.0)*np.exp(-((X[0]-\
theta[2])**2.0+(X[1]-theta[3])**2.0)/(2.0*theta[1]**2.0))

As per the authors’ earlier suggestion, we might expect

.

Using dimension-independent versions of the previous two functions (which only take an additional line of code each), I have computed the following results.

def mn(X, theta):
""" n-dimensional Moffat distribution """
return theta[0]*(theta[2] - 1.0)/(np.pi*theta[1]**2.0)\
*(1.0 + sum((X[i] - theta[3:][i])**2.0 for i in \
xrange(len(X)))/(theta[1]**2.0))**(-theta[2])
def gn(X, theta):
""" n-dimensional Gaussian distribution """
return theta[0]/(((2.0*np.pi)**(len(X) / 2.0))*(theta[1]**\
len(X)))*np.exp(-(1.0 / (2.0*(theta[1]**2.0)))\
*sum((X[i]-theta[2:][i])**2.0 for i in xrange(len(X))))
"""
I found that we would be adding O(n^d) time for dimension d
(viewing the integrals as nested loops), so I only attempted n≤4
as that seemed to be the max my laptop could handle in a reasonable
amount of time (about an hour).
>>> subset = [-3.0, 3.0]
>>> integrate.nquad(lambda x, y, z, t: gn([x, y, z, t],
[1.0, 1.0, 0.0, 0.0, 0.0, 0.0]) - mn([x, y, z, t],
[1.0, 1.0, 100.0, 0.0, 0.0, 0.0, 0.0]),
[subset, subset, subset, subset])[0]
0.9571874015506082
>>> integrate.nquad(lambda x, y, z: gn([x, y, z],
[1.0, 1.0, 0.0, 0.0, 0.0]) - mn([x, y, z],
[1.0, 1.0, 100.0, 0.0, 0.0, 0.0]),
[subset, subset, subset])[0]
0.8131058082956795
>>> integrate.nquad(lambda x, y: gn([x, y],
[1.0, 1.0, 0.0, 0.0]) - mn([x, y],
[1.0, 1.0, 100.0, 0.0, 0.0]),
[subset, subset])[0]
-0.005392303232636786
"""

I think the reason these don’t quite match the prediction is due to roundoff error, or because isn’t quite far enough; if you think I have actually made a mistake here, feel free to let me know in the comments. Because it isn’t necessary to write this program to be dimensionally independent (as much as I would enjoy the challenge), we’ll stick to the previous two functions.

The seven partial derivatives I’ve listed above may be written as follows.

from math import log
import numpy as np
from numpy.linalg import norm
### Gaussian
def PartialAg2(X, theta):
return g2(X, theta) / theta[0]
def PartialSigmaXg2(X, theta):
return g2(X, theta)*((norm(X-theta[2:])**2.0)/(theta[1]**3.0)-2.0/
theta[1])
def PartialMuXg2(X, theta, NAXIS=0):
return (g2(X, theta)*(X[NAXIS]-theta[NAXIS + 2])) / (theta[1]**2.0)
### Moffat
def PartialAm2(X, theta):
return m2(X, theta) / theta[0]
def PartialAlpham2(X, theta):
return (2.0*m2(X,theta)/theta[1])*
((theta[2]*norm(X-theta[3:])**2.0)/
(theta[1]**2.0 + norm(X - theta[3:])**2.0) - 1.0)
def PartialBetam2(X, theta):
return m2(X, theta)*(log((theta[1]**2.0) /
(theta[1]**2.0+norm(X-theta[3:])**2.0))+1.0 / (theta[2]-1.0))
def PartialMuXm2(X, theta, NAXIS=0):
return (2.0*m2(X, theta)*theta[2]*(X[NAXIS]-theta[3+NAXIS]))/
(theta[1]**2.0 + norm(X - theta[3:])**2.0)
### Also include the Gaussian and Moffat functions above

We may now piece together a class with methods for performing linear GD.

from kernel import *
from numpy.linalg import norm
from numpy.random import randn
import partials
import numpy as np
import sys
class GD(object):
def __init__(self, array, form, derivatives, max_iterations=1000,
limit=0.01):
if (type(array).__module__ != np.__name__):
# handle lists as well
self.maximum = np.array(array).max()
self.minimum = np.array(array).min()
else:
self.maximum = array.max()
self.minimum = array.min()
if (self.maximum != 1.0 or self.minimum != 0.0):
# Map range of image array → [0, 1]
self.array = np.array(map(lambda arr:
(arr-self.minimum)/(self.maximum-self.minimum), array))
else:
# probably not common in practice...
self.array = np.array(array)
if (self.dim(self.array) > 2):
raise ValueError('Array must be dimension 2.')
self.rangeX = self.array.shape[0]
self.rangeY = self.array.shape[1]
self.form = form
self.max_iterations = max_iterations
self.limit = limit
self.derivs = derivatives
@staticmethod
def dim(arr):
if (type(arr).__module__ != np.__name__):
return len(np.array(arr).shape)
else:
return len(arr.shape)
@staticmethod
def expected(arr):
if (type(arr).__module__ != np.__name__):
return np.average(arr)
else:
return np.average(np.array(arr))
def linear(self):
""" perform linear GD using all derivatives """
self.param_list = []
# randomly generate initial parameter vector
theta = np.abs(randn(len(self.derivs) + 1)) +
(((self.rangeX + self.rangeY) // 2) - 1) // 2
theta[0] = 1.0
# perform GD using all partial derivatives
for _ in xrange(self.max_iterations):
cost = self.__cost_(theta)
theta += self.__gamma_(1.0)*cost
self.param_list.append(theta.tolist())
#print theta
if (not (0 < theta[-2] < self.rangeX) or
not (0 < theta[-1] < self.rangeY)):
break
if (norm(cost) < self.limit):
break
if (norm(cost) > 10e3):
print "Values diverged."
sys.exit(0)
return theta, np.array(self.param_list)
def __gamma_(self, magnitude=1.0):
""" estimate the value of γ """
if (len(self.derivs) + 1 == 4):
# tuned for a decent Gaussian GD
return -magnitude*np.array([1e3, 25.0, 2e2, 2e2])
elif (len(self.derivs) + 1 == 5):
# tuned for a decent Moffat GD
return -magnitude*np.array([1e2, 1e1, 1e1, 1e1, 1e1])
def __cost_(self, theta):
""" compute the cost, Ω(mN^2) time. """
total_cost = np.zeros(len(self.derivs) + 1)
for i in xrange(self.rangeX):
for j in xrange(self.rangeY):
# get difference between model and array
cost = self.form([i, j], theta) - self.array[i][j]
# set up nabla
nabla = np.zeros(len(self.derivs) + 1)
# compute partials at each pixel
for k in range(len(self.derivs) - 1):
nabla[k] = self.derivs[k]([i, j], theta)
else:
# compute the last derivative twice, once for each
# axis. This may be collapsed into one line
# with a slight modification to the partial
# functions.
nabla[-1] = self.derivs[-1]([i, j], theta, NAXIS=1)
nabla[-2] = self.derivs[-1]([i, j], theta, NAXIS=0)
# scale nabla by the difference
total_cost += cost*nabla
else:
return total_cost / (self.rangeX*self.rangeY)

Most of the work in this class is done in the cost method, which builds by iterating over the entire image subset and computing the partials at each pixel. Also note that I’ve written the code to work with lists as input too, so the class is a bit more flexible.

In the event that you don’t want to print out `theta`

on every iteration (which fills the terminal), the following quick lines (or some variation thereof) may be substituted for that which is highlighted above.

if (len(self.derivs) + 1 == 4):
sys.stdout.write('%ft%ft%ft%fr' % (theta[0], theta[1],
theta[2], theta[3]))
sys.stdout.flush()
elif (len(self.derivs) + 1 == 5):
sys.stdout.write('%ft%ft%ft%ft%fr' %
(theta[0], theta[1],
theta[2], theta[3],
theta[4]))
sys.stdout.flush()

Now before we begin testing this algorithm on real data, let’s test it on a controlled set of examples with and without additive Gaussian noise. To generate these examples I have re-purposed the following class I wrote for UAV club (to convolve images).

import scipy.integrate as integrate
import numpy as np
import partials
class Kernel(object):
""" creates kernels """
def __init__(self, random_noise=False, noise_level=0.001):
self.random_noise = random_noise
if (random_noise):
self.noise_level = noise_level
def Kg2(self, X, Y, sigma, muX=0.0, muY=0.0):
""" return a 2 dimensional Gaussian kernel """
kernel = np.zeros([X, Y])
theta = [1.0, sigma, muX, muY]
for i in xrange(X):
for j in xrange(Y):
kernel[i][j] = integrate.dblquad(lambda x, y:
partials.g2([x + float(i) - (X-1.0)/2.0,
y + float(j) - (Y-1.0)/2.0], theta),
-0.5, 0.5, lambda y: -0.5, lambda y: 0.5)[0]
if (self.random_noise):
kernel = kernel +
np.abs(np.random.randn(X, Y))*self.noise_level
maximum = np.max(kernel)
minimum = np.min(kernel)
return map(lambda arr:
(arr-minimum)/(maximum-minimum), kernel)
else:
maximum = np.max(kernel)
minimum = np.min(kernel)
print maximum, minimum
return map(lambda arr:
(arr-minimum)/(maximum-minimum), kernel)
@staticmethod
def FWHMg2(theta):
""" get the FWHM of the Gaussian distribution """
return 2*theta[3]*sqrt(log(1.0 /
((pi**2.0)*(theta[3])**4.0)))
def Km2(self, X, Y, alpha, beta, muX=0.0, muY=0.0):
""" return a 2 dimensional Moffat kernel """
kernel = np.zeros([X, Y])
theta = [1.0, alpha, beta, muX, muY]
for i in xrange(X):
for j in xrange(Y):
kernel[i][j] = integrate.dblquad(lambda x, y:
partials.m2([x+ float(i) - (X-1.0)/2.0,
y + float(j) - (Y-1.0)/2.0], theta),
-0.5, 0.5, lambda y: -0.5, lambda y: 0.5)[0]
if (self.random_noise):
kernel = kernel +
np.abs(np.random.randn(X, Y))*self.noise_level
maximum = np.max(kernel)
minimum = np.min(kernel)
return map(lambda arr:
(arr-minimum)/(maximum-minimum), kernel)
else:
maximum = np.max(kernel)
minimum = np.min(kernel)
return map(lambda arr:
(arr-minimum)/(maximum-minimum), kernel)
@staticmethod
def FWHMm2(theta):
""" get the FWHM of the Moffat distribution """
return 2*theta[0]*sqrt(pow(2, 1 / theta[1]) - 1.0)

Now that we *know* what the accepted values should be, we can truly test the results of the algorithm. Below is a sample Moffat distribution, generated by the above script.

And another without background noise for visual comparison to that above.

Moffat without additive noise, same parameters as the last.

We may create example PSFs as follows.

>>> from kernel import *
>>> PSF = Kernel(random_noise=False, noise_level=0.01)
>>> example = PSF.Kg2(15, 15, sigma=1.75, muX=0.0, muY = 0.0)

And perform a fit:

>>> import partials
>>> from GD import *
>>> fit = GD(image1
... , form=partials.g2
... , derivatives=[partials.PartialAg2
... , partials.PartialSigmaXg2
... , partials.PartialMuXg2]
... , max_iterations=1000
... , limit=1e-5
... )
>>> theta, progress = fit.linear()

I could have written the code so the partials are inherent to the class, but I thought it better for the future (in case we wanted to try a different dirstribution) to just submit each partial, as well as the related function. Also note that I’ve passed `partials.PartialMuXg2`

**last** – for this class it’s important that the ‘s be in the last position.

I get a convergence for in ~500 iterations using the above pre-defined rate vector, and ~1200 for . Again, these rates aren’t optimal, so it doesn’t make sense to compare. Below are graphs showing the convergence of either function to a generated with parameters , , (and of course it’s rescaled in `kernel.py`

to ).

Now for a bunch of noise to see how well the algorithm copes. The centroid shouldn’t vary much, but I’m expecting , , and to change quite a bit.

I didn’t get a decent convergence from inside of 5000 iterations. Setting the rates faster caused a bit of chaos (e.g. see the last section). I’m assuming that a larger array may have been necessary, as in the fits with SDSS data (below), which worked fine.

Convergence of occurred in about the same number of iterations, probably indicative that it’s a more robust model.

Now that we have definitive results showing that the algorithm has the ability to converge to the approximate expected values, we’ll try it on real data.

### Working with FITS Images and Testing on Real Data

We’ll try fitting these two models to single stars in the latest SDSS3 data release, DR9. You can try this on any data you like, however.

Specifically, I have downloaded frame 15 from the G-band. If you would like to use these same data, you may download and uncompress the image as follows.

wget http://data.sdss3.org/sas/dr9/boss/photoObj/frames/301/8158/6/frame-g-008158-6-0015.fits.bz2
bzip2 -dk frame-g-008158-6-0015.fits.bz2

We can open these images with PyFITS in the following function.

import pyfits
def get_fits_array(image_name, axis=0, info=False):
""" Get the data from an image along a specified axis. """
try:
image = pyfits.open(image_name, mode='readonly')
if (info):
print image.info()
return image[axis].data
finally:
image.close()

This returns a NumPy array containing the image data. Extracting stars and objects from the frame is a huge statistical problem in itself, but here I’m just going to take the array and crop it nicely about a star that I’ve physically located. By the way, if you’re not already familiar, I’ve found that SExtractor is extremely simple to use for this kind of work too.

>>> image = get_fits_array('frame-g-008158-6-0015.fits')
>>> image = image[1046:1086,530:570]

We can also plot this image and view it as follows.

# be careful logging - there may be -'s or 0's
image[np.where(image < 0.0)] = 0.01
plt.imshow(np.log10(image), interpolation='none', cmap='gray')
plt.show()

A single star in the g-band image. I’ve purposely made it off-center as well.

The function required quite a bit of time to compute in either case because the subset above is now 1600 pixels instead of a mere 225.

The Moffat function didn’t take nearly as long to converge as I was expecting, and its parameters do some interesting things too. The only bottleneck is the extra partial derivative it has to compute at each pixel, which becomes more noticeable as the array becomes larger (not to mention the fact that its derivatives are a bit more complex).

Parameters such as , and compare favorably with those of the Gaussian. It also appears that the Moffat function becomes far closer to normal during the descent, before very slowly decreasing again. Altogether I do not think this is a good method for solving for as it’s a “weaker” variable, only having an impact after the distribution is centered and is closer to its optimal value.

### Improvements

I mentioned above that I would like to implement a method that determines a decent “learning rate” for each parameter. It’s rather easy to drive the parameters too fast, causing unstable results like that below.

GD that’s driven too hard makes for chaotic results.

I’m also interested in pursuing one of the many second order GD methods, but again I am still reading the literature and working out the details (especially the computation of the Hessian).

If you can think of any ways to improve this code feel free to let me know in the comments!

### Sources

[1] *The effects of seeing on Seeing on Sérsic profiles – II. The Moffat PSF*, by Trujillo et. al. (2001)

[2] *A Theoretical Investigation of Focal Stellar Images in the Photographic Emulsion and Application to Photographic Photometry*, by A. Moffat (1969)

[3] as the limit of , by D. Joyce (click here for the .pdf)

[4] *Stochastic Optimization*, by J. Spall, Handbook of Computational Statistics p. 173

[5] *Common-Resolution Convolution Kernels for Space- and Ground-Based Telescopes*, by G. Aniano et. al.

[6] *Advanced Calculus*, D. Widder (1898), p. 371