Quick recipe for a bibliography with BibTeX

I like typing math homework in \LaTeX if I have the time, and lately I’ve wanted to find an easier way to cite sources and create a bibliography at the end of a paper, should I use any.

One way I’ve probably used in the past is to just list out all the information as it’s bound to be displayed in the document.

But this is tedious.

A far quicker and easier way is to use BibTeX, which was hardly straightforward at first (why I’m writing this quick little tutorial for my future self and others).

First, go look look up your text on MathSciNet. We’ll look up the same text I’ve listed above. At the top we have the option to get BibTeX information from the drop down menu:

Now copy/paste this information into a file *.bib (you may name it whatever you’d like). At the bottom of your \TeX document before \end{document} put the following lines:

There are many different styles you can choose from, I’ve just kept it basic.

Now we recompile using pdflatex and bibtex – I’ve written the following simple bash script to automate the process if you’re in the same directory as the file.

I think I also saw it in one of the Stack Exchange forums where someone suggested Python, so I wrote the following little script too.

Because I’m more comfortable with Python than Bash, I tried to make the script pretty flexible. Running either of these produces the results we were expecting on a sample document:


An example compiled source.


Now all you have to do is include \cite{MR986070} in the spots you wish to reference [1] (and I wish it were this simple in WordPress too :P). I’m sure there are other nifty tricks and details I’ve overlooked or haven’t come across yet, but I think this is the gist of it.

In the code I uploaded to GitHub for future reference I’ve also included a bunch of if-statements to remove files that clutter the directory, but these are of course optional.

Posted in Bash, LaTeX, Programming, Python | Tagged , | Leave a comment

Fitting PSF Models to Stellar Distributions in FITS Data

As an end-of-semester project I would like to investigate the mathematics of Gaussian and Moffat distributions, which are used to model stars in astronomical data. My inspiration for writing this post came from reading [1] and [5], where many interesting analytics are presented.

Specifically, in this post we’ll generate example PSFs and sources therefrom, extract stellar sources from images taken by the SDSS, and fit Gaussian and Moffat functions to each using a linear gradient descent (GD) algorithm.

The Gaussian Distribution

We will consider the two dimensional case of the bivariate normal distribution since image arrays are usually two dimensional arrays of pixels.


where the covariance matrix \Sigma will be defined as


so that


and thus \arg\exp\left(d_{M}^2\left(X,\mu\right)\right) becomes



Therefore, we arrive at a simplified distribution


which is basically a fixed-axis elliptical normal distribution. Furthermore, because the property is rather elegant, we see that



and by parameterizing R\begin{bmatrix}\cos\left(\theta\right)\\\sin\left(\theta\right)\end{bmatrix}+\mu\mapsto X, we obtain



Thence the former result is



and by yet another substitution one may simplify this integral drastically to


If \sigma_{1}=\sigma_{2}=1 (to obtain the “standard normal distribution”), then the volume of g_{2,c}=1. This is a widely understood identity that easily generalizes to higher dimensions (e.g. see [4] in my last post). On another mathematical note, in reference to [6], I might suspect the multivariate normal distribution to aid in determining the value of \Gamma_{m}\left(\frac{1}{2}\right), the multivariate gamma function [I will have to investigate this in a future post].

Because we will only be considering the circular Moffat function in the next section, here we will restrict this function even further by setting \sigma_{1}=\sigma_{2}=\sigma>0. This also means that


where I_{2} is the square identity matrix. If you’ll recall from statistics, when \Sigma=I we just obtain the Euclidean metric from d_{M}, so the entire equation simplifies to


where ||\cdot|| denotes the Euclidean norm.

Therefore, the only axes we must solve for in the gradient descent algorithm are translation, i.e. the centroid of a source, and the variance (which is related to the stability of the atmosphere). Feature scaling of the dataset will also be important, because the ADUs won’t be in the range \big[0,1\big].

The Moffat Distribution

The authors of [1] present the Gaussian as a limiting case of the more general Moffat distribution, introduced by none other than A. Moffat in 1969 [2]. The circular case is defined as


The authors, again, give a very interesting derivation in appendix A which fortifies their claim that


I would like to retrace their steps here. But first, as with the former distribution, we shall determine (or simply verify, as it turns out) the conditions for which the volume of m_{2} is unity. We see that



and by the same substitution we used last time, we may simplify this integral to






which works out quite nicely. Continuing on our plight to unravel their work in appendix A, we solve for the full width at half maximum (FWHM) of m_{2}, which may be visualized as the width of the fitted PSF halfway along its magnitude. Thus

\dfrac{1}{2}=\dfrac{\pi\alpha^2\,m_{2}(\,\vec{0},\mu;\alpha,\beta\,)}{\beta-1} (normalized amplitude)



\implies ||\mu||=\alpha\sqrt{2^{1/\beta}-1},


The extra factor of 2 came from the fact that we’re trying to find the width, whereas the penultimate equation yields the radius (so we multiply it by 2). Next, we eliminate \alpha in m_{2} by writing it in terms of \text{FWHM}_{m}, since


For this I obtain the expression


Now we can make the case that since


it would be perfectly fine to substitute \log\left(2\right)/\beta\mapsto 2^{1/\beta}-1 as


Hence, altogether we have


At this point it may appear ridiculously complex. But if you’ll recall the classic identity

\displaystyle\lim_{x\rightarrow\infty}\left(1+\dfrac{1}{x}\right)^{x}=e (c.f. [4])

we are able to obtain the exponential from the limit


Look familiar yet? Well, we’re off by a few constants; letting \text{FWHM}_{m}^{2}=8\sigma^2\log\left(2\right), as the authors suggest,



In the next section we will write scripts that ‘shift and shimmy’ the m_{2} and g_{2,c} distributions about until a reasonable convergence is obtained by the GD algorithm; moreover, we will compare the fits for both the Gaussian and Moffat distributions. We shall also test the authors’ claim that setting \beta\sim 100 in m_{2} should yield similar results to g_{2,c} (I don’t have any doubts).

Gradient Descent

The gradient descent algorithm is ubiquitous in machine learning. It was originally devised as a first order method to arrive at the local optima of a multivariate function by Isaac Newton through iteration. We will utilize this tool to optimize the parameters of m_{2} and g_{2,c} for various real-world examples of stellar distributions in astronomical data.

The first order algorithm may be written recursively as


where in this case, \text{cost} is defined as

\text{cost}(\vec{\theta}_{\kappa}; h,z)=\dfrac{1}{2P}\displaystyle\sum_{i=1}^{P}(h^{i}(\vec{\theta}_{\kappa})-z^{i})^{2}


(e.g. see 173 of [4]), and P=\prod_{j=1}^{\dim I_{s}}\text{NAXIS}_{j} is the total size of the image subset containing a star. We will be substituting g_{2,c} and m_{2} for h, and their parameter vectors \vec{\theta} will be \begin{bmatrix}A & \sigma & \mu_{1} & \mu_{2}\end{bmatrix}^{\text{T}} and \begin{bmatrix}A & \alpha & \beta & \mu_{1} & \mu_{2}\end{bmatrix}^{\text{T}}, respectively. I have added A to scale either function, which is necessary since scaling the data to [0,1] does not mitigate the possibility that the amplitude of the best fit is slightly greater than unity. Moreover,


so we may simplify the necessary computations slightly.

\eta_{\kappa} is the “learning rate” of the algorithm. In the past I’ve just set this to a constant value, but I’ve found that, in practice, a constant vector works a little better when one parameter may depend on others. This requires a little tuning to arrive at a decent (though not quite optimal) rate along each parameter. In the future I hope to implement a simple method to allow the algorithm to tune its own learning rates (I’m still reading about this in the literature).

To begin, let’s compute the 7 partial derivatives. I arrive at




for g_{2,c}, and




and \partial_{\mu_{i}}\,m_{2}=\dfrac{2\,m_{2}\,\beta\left(x_{i}-\mu_{i}\right)}{\alpha^{2}+||X-\mu||^{2}},

for m_{2}. I’ve written m_{2} and g_{2,c} in Python as follows.

As per the authors’ earlier suggestion, we might expect

\displaystyle\int_{\mathbb{R}^{n}}(m_{n}(X\,\mu;\alpha,100)-g_{n,c}(X;\mu,\Sigma))\,\mathrm{d}X\approx 0.

Using dimension-independent versions of the previous two functions (which only take an additional line of code each), I have computed the following results.

I think the reason these don’t quite match the prediction is due to roundoff error, or because \beta=100 isn’t quite far enough; if you think I have actually made a mistake here, feel free to let me know in the comments. Because it isn’t necessary to write this program to be dimensionally independent (as much as I would enjoy the challenge), we’ll stick to the previous two functions.

The seven partial derivatives I’ve listed above may be written as follows.

We may now piece together a class with methods for performing linear GD.

Most of the work in this class is done in the cost method, which builds \vec{\nabla}\text{cost} by iterating over the entire image subset and computing the partials at each pixel. Also note that I’ve written the code to work with lists as input too, so the class is a bit more flexible.

In the event that you don’t want to print out theta on every iteration (which fills the terminal), the following quick lines (or some variation thereof) may be substituted for that which is highlighted above.

Now before we begin testing this algorithm on real data, let’s test it on a controlled set of examples with and without additive Gaussian noise. To generate these examples I have re-purposed the following class I wrote for UAV club (to convolve images).

Now that we know what the accepted values should be, we can truly test the results of the algorithm. Below is a sample Moffat distribution, generated by the above script.


An example Moffat distribution with additive noise N=0.005, and parameters \alpha=3.5, \beta=4.5.

And another without background noise for visual comparison to that above.


Moffat without additive noise, same parameters as the last.

We may create example PSFs as follows.

And perform a fit:

I could have written the code so the partials are inherent to the class, but I thought it better for the future (in case we wanted to try a different dirstribution) to just submit each partial, as well as the related function. Also note that I’ve passed partials.PartialMuXg2 last – for this class it’s important that the \partial_{\mu_{i}}‘s be in the last position.

I get a convergence for g_{2,c} in ~500 iterations using the above pre-defined rate vector, and ~1200 for m_{2}. Again, these rates aren’t optimal, so it doesn’t make sense to compare. Below are graphs showing the convergence of either function to a generated g_{2,c} with parameters \sigma=1.75, \mu_{1}=7, \mu_{2}=7 (and of course it’s rescaled in kernel.py to [0,1]).


Convergence of the Moffat function to a Gaussian profile. What’s interesting is that \beta (red) converges to a relatively high value, so the function is closer in shape to a Gaussian. In this plot Blue=A, Red=\beta, Purple, Cyan=x_{1}, x_{2} and Green=\alpha.


Convergence of the Gaussian function to a Gaussian profile. Because I sought to initialize \mu_{1} and \mu_{2} (two indistinguishable lines about 7.0) as close to center as possible, they converge extremely fast, whereas A (blue) and \sigma (green) take a few more iterations.

Now for a bunch of noise to see how well the algorithm copes. The centroid shouldn’t vary much, but I’m expecting A, \sigma, \alpha and \beta to change quite a bit.


I didn’t get a decent convergence from m_{2} inside of 5000 iterations. Setting the rates faster caused a bit of chaos (e.g. see the last section). I’m assuming that a larger array may have been necessary, as in the fits with SDSS data (below), which worked fine.


Convergence of g_{2,c} occurred in about the same number of iterations, probably indicative that it’s a more robust model.

Now that we have definitive results showing that the algorithm has the ability to converge to the approximate expected values, we’ll try it on real data.

Working with FITS Images and Testing on Real Data

We’ll try fitting these two models to single stars in the latest SDSS3 data release, DR9. You can try this on any data you like, however.

Specifically, I have downloaded frame 15 from the G-band. If you would like to use these same data, you may download and uncompress the image as follows.

We can open these images with PyFITS in the following function.

This returns a NumPy array containing the image data. Extracting stars and objects from the frame is a huge statistical problem in itself, but here I’m just going to take the array and crop it nicely about a star that I’ve physically located. By the way, if you’re not already familiar, I’ve found that SExtractor is extremely simple to use for this kind of work too.

We can also plot this image and view it as follows.



A single star in the g-band image. I’ve purposely made it off-center as well.

The \text{cost} function required quite a bit of time to compute in either case because the subset above is now 1600 pixels instead of a mere 225.


The values the algorithm arrived at are A=27.06944038, \sigma=2.11533306, x_{1}=20.71500907, and x_{2}=20.69633224.

The Moffat function didn’t take nearly as long to converge as I was expecting, and its parameters do some interesting things too. The only bottleneck is the extra partial derivative it has to compute at each pixel, which becomes more noticeable as the array becomes larger (not to mention the fact that its derivatives are a bit more complex).


GD with m_{2} arrived at A=27.28057256, \alpha=13.05963548, \beta=20.17842481, 20.71206149, 20.69690351.

Parameters such as A, x_{1} and x_{2} compare favorably with those of the Gaussian. It also appears that the Moffat function becomes far closer to normal during the descent, before very slowly decreasing \beta again. Altogether I do not think this is a good method for solving for \beta as it’s a “weaker” variable, only having an impact after the distribution is centered and A is closer to its optimal value.


I mentioned above that I would like to implement a method that determines a decent “learning rate” for each parameter. It’s rather easy to drive the parameters too fast, causing unstable results like that below.


GD that’s driven too hard makes for chaotic results.

I’m also interested in pursuing one of the many second order GD methods, but again I am still reading the literature and working out the details (especially the computation of the Hessian).

If you can think of any ways to improve this code feel free to let me know in the comments!


[1] The effects of seeing on Seeing on Sérsic profiles – II. The Moffat PSF, by Trujillo et. al. (2001)

[2] A Theoretical Investigation of Focal Stellar Images in the Photographic Emulsion and Application to Photographic Photometry, by A. Moffat (1969)

[3] e as the limit of \left(1+\frac{1}{n}\right)^{n}, by D. Joyce (click here for the .pdf)

[4] Stochastic Optimization, by J. Spall, Handbook of Computational Statistics p. 173

[5] Common-Resolution Convolution Kernels for Space- and Ground-Based Telescopes, by G. Aniano et. al.

[6] Advanced Calculus, D. Widder (1898), p. 371

Posted in Algorithms, Astronomy, Mathematics, Python | Tagged , , , | 1 Comment

The RX Algorithm

Over the past few months I have been writing programs to perform basic computer vision tasks for a drone we built in UAV club. To be blunt, I learned a lot – both about what I’m capable of, and what it means to work on a small team.

Before I explain the algorithm and share the code I wrote, I’ll outline our goals. They were to

  1. obtain usable imagery despite any vibrations and attitude variations of the quadcopter,
  2. search a complete frame for anomalies (colored targets),
  3. and classify any anomalies.

My teammate worked a lot on algorithms and methods related to #3, while I was slightly more interested in #’s 1 & 2. I saw the goals as hierarchical; namely that a) we probably cannot classify anomalies in the data if we have no means to locate them in entire frames, and b) we cannot search the data for anomalies if the data are not clean and unblurred.

In this post I will talk briefly about #2.

After a lot of research I eventually found several papers on anomaly detection, including [1], [2] and [3]. I also went to the library and found a book on color science – a field for which I now have the utmost of respect, because the mathematics involved with color matching and other concepts is quite elegant. Moreover, in each of aforementioned papers the authors cover the RX algorithm, so I thought I’d give it a try.

Consider the multivariate Gaussian distribution

g_{n}\left(X; \mu,\Sigma\right)=\dfrac{1}{\left(2\pi\right)^{n/2}|\Sigma|^{1/2}}\exp\left[-\dfrac{1}{2}\left(X-\mu\right)^{\text{T}}\Sigma^{-1}\left(X-\mu\right)\right],

where |\cdot| denotes the determinant and \Sigma is the covariance matrix. There are a lot of interesting properties of this function that you may read about in [4]. For the RX algorithm we are, for the most part, only interested in the argument of the exponential function


known as the Mahalanobis distance. It is basically a measure of how many standard deviations X, a spectral vector in the RGB tristimulus space, lies from the mean spectral vector mu. For the RX algorithm we hope to compute this distance metric for all pixels in an image (which we may visualize as vectors in the space \text{RGB}=\left[0,255\right]^{3}).

This operation’s complexity is similar to that of convolution (it will be bound below by \Omega\left(N^2\right) at least), but there are a few steps we may take to reduce the complexity of the problem. For instance, to compute the covariance matrix \Sigma for each image it’s not necessary to use every single pixel. In [3] the authors state that \Sigma “is the estimated background covariance matrix.” By extracting a random sample of the image pixels (in practice I found that only ~10^4 worked well, but as few as 10^3 yielded reasonable results) this matrix can be estimated quickly.

See the below figure for an actual comparison (the Frobenius norm of an average of 100 runs for each extraction sample size, ranging from 10^{2}10^{5}).


This process assumes that the background (the ground) is the dominant feature in a given image, and from a high vantage point this is likely the case.

Yet another simplification we can make is from the properties of the matrix \Sigma. It is commonly defined as



since \Sigma is symmetric. We may use this definition so that we only have to compute p\left(p+1\right)/2 values instead of every entry, where p is the dimension of the original image.

Now to compute d_{M} I’ve written the following class, which utilizes many vector and matrix operations available in NumPy.

This code yields the following results upon a sample image.


Artifacts are due to the image compression and not the algorithm itself. As we can see, the objects that are “different” show up far brighter (like the pink target at upper-right), whereas the background, viz. grass and runway, are much darker.


I would like to try writing this algorithm in pure C to get a sense of how expensive it actually is to compute – as of late I’ve felt quite distant from the “happenings” while using Python.


[1] Anomaly Detection based on a parallel kernel RX algorithm for multicore platforms, Molero, J. et. al.

[2] Is there a best hyperspectral detection algorithm?, Manolakis, D. et. al.

[3] Anomaly Detection for Hypaerspectral Imagery Using Analytical Fusion and RX, Zhang, G. and Yang, C.

[4] The Multivariate Gaussian Distribution, by Do, C.

Posted in Algorithms, Mathematics, Programming, Python | Tagged | 3 Comments


I won’t get the chance to write another post on February 29 for four years, so I wrote the following script in to give us some chaos to remember it by [1].

And the output after like 20 minutes:


The two biggest problems were memory overflow (the Titanium has a whopping 188 kB of RAM) and dimension mismatches. The latter may sound like it has a simple fix, but I found it rather difficult to wrestle the lists to the same length; there are just so many places you can mess up! Not to mention how confusing it is to index lists at 1 !

If you have suggestions for improvements please let me know in the comments.

Sources/Additional Resources

[1] Nagle, R., E. Saff, and Snider, A. Fundamentals of Differential Equations. 8th ed. Boston: Pearson/Addison Wesley, 2012. 151-52. Print.

[2] TIGCC Documentation

[3] Rosetta Code TI-89 BASIC

Posted in Programming, Ti 89 | Tagged | Leave a comment

The Method of Least Squares

This post is a primer to the method of least squares, which allows one to fit a line to roughly linear data. In my next post I hope to expand upon these results and attempt to fit a nonlinear model to astronomical data, as outlined toward the end of my video in my first post.

Consider the graph I have drawn below of ten points in the real plane.


A bunch of points we’d like to fit a line to.

Explicitly, we have the finite set



and we wish to fit the continuous function


to this plot such that the distance between f and P is minimal, hence “least squares.” To determine the error or cost between the points and f we sum their squared differences at each point as


wheren=\text{card }P. Squaring each distance makes the problem appear as though we are optimizing the area of the squares with one vertex at the point p_{0}\in P and the other on the line. The benefits are that there can’t be a negative or canceling area.

The equation of a line has two distinct constant parameters a,b that we hope to solve for that represent the slope and the y-intercept. In more complex models, an example of which we will see in my next post (I have two computers processing those results as I’m writing this), there are more parameters and iterations.

When we speak of optimization in Calculus we’re usually talking about derivatives. Taking the partial derivatives ofSwith respect to its parameters yields the following results.

\dfrac{\partial S}{\partial a}=\displaystyle\sum\limits_{k=1}^n 2x_k\left[ax_k+b-y_k\right]

=2a\displaystyle\sum\limits_{k=1}^n x_k^2+2b\displaystyle\sum\limits_{k=1}^n x_k-2\displaystyle\sum\limits_{k=1}^n x_k y_k

=a\displaystyle\sum\limits_{k=1}^n x_k^2+b\displaystyle\sum\limits_{k=1}^n x_k-\displaystyle\sum\limits_{k=1}^n x_k y_k=0,

\dfrac{\partial S}{\partial b}=\displaystyle\sum\limits_{k=1}^n 2\left[ax_k+b-y_k\right]

=2a\displaystyle\sum\limits_{k=1}^n x_k+2bn-2\displaystyle\sum\limits_{k=1}^n y_k

=a\displaystyle\sum\limits_{k=1}^n x_k+bn-\displaystyle\sum\limits_{k=1}^n y_k=0.

I have set them equal to 0 because, after all, optimal values of functions (relative mins and maxes) occur where \partial_{a}S=0 and \partial_{b}S=0. By the last result we can actually solve directly for b as

b=\dfrac{\displaystyle\sum\limits_{k=1}^ny_k-a\displaystyle\sum\limits_{k=1}^n x_k}{n}.

Furthermore, by direct substitution of this value ofbinto the former partial derivative we may solve fora(this is what makes linear regression so “nice”) to obtain

a\displaystyle\sum\limits_{k=1}^n x_k^2+\left(\dfrac{\sum_{k=1}^ny_k-a\sum_{k=1}^n x_k}{n}\right)\displaystyle\sum\limits_{k=1}^n x_k-\displaystyle\sum\limits_{k=1}^n x_k y_k=0

\implies a=\dfrac{n\displaystyle\sum\limits_{k=1}^n x_ky_k-\left(\displaystyle\sum\limits_{k=1}^n y_k\right)\left(\displaystyle\sum\limits_{k=1}^n x_k\right)}{n\displaystyle\sum\limits_{k=1}^n x_k^2 -\left(\displaystyle\sum\limits_{k=1}^n x_k\right)^2}.

These two sums are incredibly simple to program in Python. Applying these equations to the set of points above, we obtain summation results of

10\displaystyle\sum\limits_{k=1}^{10} x_ky_k=407.5,






and a\displaystyle\sum\limits_{k=1}^{10}x_k=1\cdot a.

Therefore,a=\dfrac{407.5-\left(-2.5\right)\left(1\right)}{500-\left(1\right)}\approx 0.82164328\cdots andb=\dfrac{-2.5-a}{10}\approx -0.33216432\cdots, giving us a function

y\approx 0.82164328x-0.33216432.

Graphing this against the points above shows that it is indeed a pretty good fit:


The result of fitting a line to this set of points.

Another verification we can use to determine that this is the optimal value is by taking a look at the Hessian matrix,

\det H_{S}=\det\begin{bmatrix}\displaystyle\sum\limits_{k=1}^n x_k^2 & \displaystyle\sum\limits_{k=1}^n x_k\\\displaystyle\sum\limits_{k=1}^n x_k & \displaystyle\sum\limits_{k=1}^n 1\end{bmatrix}

=\left(\displaystyle\sum\limits_{k=1}^n x_k^2\right)\left(\displaystyle\sum\limits_{k=1}^n 1\right)-\left(\displaystyle\sum\limits_{k=1}^n x_k\right)^2

=\dfrac{1}{2}\displaystyle\sum\limits_{j,k=1}^n\left(x_j-x_k\right)^2>0 (by Lagrange’s Identity)

which verifies that it is in fact the best-fit line.

As I’ve already mentioned, in my next post I hope to expand upon almost everything here with a direct application to fitting functions to discrete stellar profiles in astronomical images. I basically hope to compare Gaussian vs. Moffat function fits to several hundred stars in SDSS data, very similar to the results of Trujillo et. al. [1].


[1] The effects of seeing on Sersic profiles – II. The Moffat PSF, by Trujillo et. al.

Posted in Mathematics | Tagged | 2 Comments