Sunday, March 15, 2009

Using the Gaussian Kernel Density Estimation

In scipy.stats we can find a class to estimate and use a gaussian kernel density estimator, scipy.stats.stats.gaussian_kde.

Until recently, I didn’t know how this part of scipy works, and the following describes roughly how I figured out what it does.

First, we can look at help to see whether the documentation provides enough information for its use. Here is the first part.

>>> help(stats.gaussian_kde)
Help on class gaussian_kde in module scipy.stats.kde:

class gaussian_kde(__builtin__.object)
 |  Representation of a kernel-density estimate using Gaussian kernels.
 |
 |   Parameters
 |   ----------
 |   dataset : (# of dims, # of data)-array
 |       datapoints to estimate from
 |
 |   Members
 |   -------
 |   d : int
 |       number of dimensions
 |   n : int
 |       number of datapoints
 |
 |   Methods
 |   -------
 |   kde.evaluate(points) : array
 |       evaluate the estimated pdf on a provided set of points
 |   kde(points) : array
 |       same as kde.evaluate(points)
 |   kde.integrate_gaussian(mean, cov) : float
 |       multiply pdf with a specified Gaussian and integrate over the whole domain
 |   kde.integrate_box_1d(low, high) : float
 |       integrate pdf (1D only) between two bounds
 |   kde.integrate_box(low_bounds, high_bounds) : float
 |       integrate pdf over a rectangular space between low_bounds and high_bounds
 |   kde.integrate_kde(other_kde) : float
 |       integrate two kernel density estimates multiplied together
 |
 |  Internal Methods
 |  ----------------
 |   kde.covariance_factor() : float
 |       computes the coefficient that multiplies the data covariance matrix to
 |       obtain the kernel covariance matrix. Set this method to
 |       kde.scotts_factor or kde.silverman_factor (or subclass to provide your
 |       own). The default is scotts_factor.
 |
 |  Methods defined here:
 |
 |  __call__ = evaluate(self, points)
 |
 |  __init__(self, dataset)
 |
 |  covariance_factor = scotts_factor(self)
 |
 |  evaluate(self, points)
 |      Evaluate the estimated pdf on a set of points.
 |
 ...

The first two basic methods that I was interested in is the initialization, __init__ just takes a data set as a argument, and then evaluate, or __call__ which takes as function argument the list of points at which we want to evaluated the estimated pdf.

Let’s just try this out:

First, I get the standard imports:

import numpy as np
from scipy import stats
import matplotlib.pylab as plt

then I generate a sample that I can feed to the kde, as usual for continuous variables, I start with a normal distribution:

n_basesample = 1000
np.random.seed(8765678)
xn = np.random.randn(n_basesample)

Now, we create an instance of the gaussian_kde class and feed our sample to it:

gkde=stats.gaussian_kde(xn)

We need some points at which we evaluate the density funtion for the estimated density function:

ind = np.linspace(-7,7,101)
kdepdf = gkde.evaluate(ind)

and finally we create the plot of the histogram of our data, together with the density that created our sample, the data generating process DGP, and finally the estimated density.

plt.figure()
# plot histgram of sample
plt.hist(xn, bins=20, normed=1)
# plot data generating density
plt.plot(ind, stats.norm.pdf(ind), color="r", label='DGP normal')
# plot estimated density
plt.plot(ind, kdepdf, label='kde', color="g")
plt.title('Kernel Density Estimation')
plt.legend()
#plt.show()

and this is the graph that we get

kde_normal.png

This graph looks pretty good, when the underlying distribution is the normal distribution, then the gaussian kernel density estimate follows very closely the true distribution, at least for a large sample as we used.

Now, to make it a bit more difficult we can look at a bimodal distribution, and see if it is still able to fit so well. As an example, I pick a mixture of two normal distributions, 60% of the sample comes from a normal distribution with mean -3 and standard deviation equal to one, 40% of the observation are from the normal distribution with mean 3 and the same standard deviation.

alpha = 0.6 #weight for (prob of) lower distribution
mlow, mhigh = (-3,3)  #mean locations for gaussian mixture
xn =  np.concatenate([mlow + np.random.randn(alpha * n_basesample),
               mhigh + np.random.randn((1-alpha) * n_basesample)])

With the new data set, we can run the same commands as above, except that we have to change the plot of the data generating density to use the mixture density instead:

plt.plot(ind, alpha * stats.norm.pdf(ind, loc=mlow) +
              (1-alpha) * stats.norm.pdf(ind, loc=mhigh),
              color="r", label='normal mix')

The next graph shows what we get in this case.

kde_mixture.png

The estimated density is oversmoothing, the peaks of the estimated pdf are too small. So the automatic selection of the smoothing parameter doesn’t work in this case. Now, it’s time to find out how gaussian_kde actually selects the smoothing parameter or bandwith of the kernel.

But I’m out of time for today. We dig into this next time.

Saturday, March 14, 2009

Warmup: Fitting Distributions

As a warmup exercise, I generate some random samples, fit two distributions and plot the results. I also calculate the Kolmogorvo-Smirnov test using scipy.stats.kstest.

This is a shortened, simplified version of a script that I wrote to see how the dostrinutions in scipy.stats can be used to automatically fit some data and select the best fitting distribution.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def plothist(x, distfn, args, loc, scale, right=1):
    '''plot histogram and pdf, based on matplotlib doc example'''
    plt.figure()
    # the histogram of the data
    n, bins, patches = plt.hist(x, 25, normed=1, facecolor='green', alpha=0.75)
    maxheight = max([p.get_height() for p in patches])
    axlim = list(plt.axis())
    axlim[-1] = maxheight*1.05

    # add more points for density plot
    pdfpoints = bins + np.diff(bins)[0]*np.linspace(-0.5,0.5,11)[:,np.newaxis]
    pdfpoints = np.sort(pdfpoints.ravel())

    # calculate and plot the estimated density function
    yt = distfn.pdf(pdfpoints, loc=loc, scale=scale, *args)
    yt[yt>maxheight] = maxheight
    lt = plt.plot(pdfpoints, yt, 'r--', linewidth=1, label='estimated')
    # calculate and plot the density function that generated the data
    ys = stats.t.pdf(pdfpoints, 10, scale=10,)*right
    ls = plt.plot(pdfpoints, ys, 'b-', linewidth=1, label='true')

    plt.legend()
    plt.xlabel('values')
    plt.ylabel('Probability')
    plt.title(r'$\mathrm{Fitting\ Distribution\ %s :}\ \mu=%f,\ \sigma=%f$'%(distfn.name,loc,scale))
    plt.grid(True)
    plt.draw()


n = 500
dgp_arg = 10
dgp_scale = 10
np.random.seed(6543789)
rvs = stats.t.rvs(dgp_arg, scale=dgp_scale, size=n)
sm = rvs.mean()
sstd = np.sqrt(rvs.var())
ssupp = (rvs.min(), rvs.max())

for distr in [stats.norm, stats.t]:
    distname = distr.name
    # estimate parameters
    par_est = distr.fit(rvs,loc=sm, scale=sstd)
    print '\nFitting distribution %s' % distname
    print 'estimated distribution parameters\n', par_est
    arg_est = par_est[:-2]  # get scale parameters if any
    loc_est = par_est[-2]
    scale_est = par_est[-1]
    rvs_normed = (rvs-loc_est)/scale_est
    ks_stat, ks_pval = stats.kstest(rvs_normed,distname, arg_est)
    print 'ks-stat = %f, ks-pval = %f)' % (ks_stat, ks_pval)
    plothist(rvs, distr, arg_est, loc_est, scale_est, right = 1)
    plt.savefig('ex_dist1_%s.png'% distname)

#plt.show() # if we want to see it on screen
(Source code)

Output

ex_dist1_norm.png
ex_dist1_t.png

The script produces the following output for the parameter estimate and the Kolmogorov-Smirnov test

Fitting distribution norm
estimated distribution parameters
[ -0.70287027  11.22248481]
ks-stat = 0.037073, ks-pval = 0.493706)

Fitting distribution t
estimated distribution parameters
[ 7.8518085  -0.69695469  9.71315677]
ks-stat = 0.019562, ks-pval = 0.990926)

The p-values of Kolmogorov-Smirnov test are derived under the assumption that we test against a known distribution and not against a distribution with estimated parameters. But in this example, the numbers look pretty informative, the p-values are large for both distributions, from which I conclude that both distributions fit overall the sample relatively well. The pvalue of the t-distribution is about twice the p-value of the normal distribution, which strongly suggest that the distribution was generated by a t-distribution and not by a normal distribution. Although, if we look only at the pvalue of the normal distribution, we wouldn't reject the hypothesis that the sample was generated by the normal distribution.

However, I want to emphasis that this is an informal interpretation, in which we can be quite confident and which we know to be correct since we generated the data, but that it is not the result of a formal statistical test for it.

script file ex_dist1.py

Editorial comments:

Sphinx and restructured text work well, and the highlighting with pygments also works without additional intervention, but adding the graphs to blogger requires some manual upload and editing of the formatting, which still takes too much time. I haven't figured out yet how to upload source files, the upload file button seems to be missing from my blog editor.

update: some google searches later, it seems that it is not possible to attach regular text files to blogger, so it needs to be hosted somewhere else

Introduction

I am trying to improve the statistics coverage in python using numpy and scipy. Here, I want to post some code examples and comments on the way in doing this.

The text and code are written in restructured text and then processed with sphinx, which I hope gets correct formatting and highlighting of code snippets. There will still be some stumbling to get this to work correctly.

A note on the copyright: Since numpy and scipy are BSD, I will be looking mostly at BSD or MIT licensed code. My code snippets are public domain, longer code parts are under the same BSD license as numpy and scipy, unless a different license is specified.