Sunday, March 4, 2012

Numerical Accuracy in Linear Least Squares and Rescaling

The Problem

(Warning upfront: there are problems to replicate this, more below)

A week ago, I stumbled on this Numerical_Comparison_of_Statistical_Software which presents some test results for numerical accuracy of statistical packages.

For linear regression, there is one test, Longley, that we have in the datasets in statsmodels. But I wanted to look at Filip which sounds like a difficult case, neither SAS nor SPSS produced a solution. Let's see how badly statsmodels and numpy are doing, or maybe not.

The model is a polynomial of degree 10. Description, data, certified values and a plot are on the NIST website here

1 Predictor Variable
82 Observations
Higher Level of Difficulty
Model: Polynomial, 11 Parameters

I parsed the data into an array dta, first column is the endogeous, y, variable second column is the exogenous, x, variable. I saved y in endog. I also parsed the main NIST result in params_nist, first column parameters, second column their standard deviation.

Fitting a Polynomial

Since it is a polynomial problem, let us treat it as such and use the polynomials from numpy.

First try, use the old polyfit function

>>> p_params = np.polyfit(dta[:,1], endog, 10)[::-1]
>>> p_params
array([-1467.48963361, -2772.17962811, -2316.37111156, -1127.97395547,
        -354.47823824,   -75.1242027 ,   -10.87531817,    -1.062215  ,
          -0.06701912,    -0.00246781,    -0.0000403 ])
>>> log_relative_error(p_params, params_nist[:,0])
array([ 7.87929761,  7.88443445,  7.88840683,  7.89138269,  7.89325784,
        7.89395336,  7.89341841,  7.89162977,  7.88859034,  7.88432427,
        7.87887292])

Not bad, following the description on the Wikipedia page, I wrote a function log_relative_error that tells us how many significant digits agreement is between the two arrays. polyfit agrees at 7 to 8 significant digits, that's about the same as S-Plus on the Wikipedia page.

Let's work properly with polynomials and use the new polynomial package in numpy. Charles Harris wrote it and is still expanding and improving it.

>>> poly = np.polynomial.Polynomial.fit(dta[:,1],endog, 10)
>>> poly
Polynomial([ 0.88784146,  0.10879327, -0.53636698,  0.28747072,  2.20567367,
       -1.31072158, -4.21841581,  1.76229897,  3.8096025 , -0.77251557,
       -1.30327368], [-8.78146449, -3.13200249])

Oops, these numbers don't look like the NIST numbers. The last numbers, [-8.78146449, -3.13200249], show the domain of the polynomial, our values have been transformed. A bit of introspection, and we figure out how to change the domain. To get the "standard" representation, we can transform the domain back to the standard domain (-1, 1).

>>> poly.convert(domain=(-1,1))
Polynomial([-1467.48961423, -2772.17959193, -2316.37108161, -1127.97394098,
        -354.4782337 ,   -75.12420174,   -10.87531804,    -1.06221499,
          -0.06701912,    -0.00246781,    -0.0000403 ], [-1.,  1.])

Now, this looks more like NIST, it even agrees at 13 to 14 significant digits

>>> log_relative_error(poly.convert(domain=(-1,1)).coef, params_nist[:,0])
array([ 13.72347502,  13.84056851,  13.81494335,  13.70878715,
        13.78207216,  13.79374075,  13.6729684 ,  13.71128925,
        13.75445952,  13.68695573,  13.67736436])

Nice job Charles. No problem fitting this polynomial with numpy.

Linear Regression

In the previous part we knew we were fitting a polynomial, but lets treat it just as a standard linear regression problem and use scikits.statsmodels.

First try: just create the design matrix in the simplest way and estimate

>>> exog0 = dta[:,1:]**np.arange(11)
>>> res0 = sm.OLS(endog, exog0).fit()
>>> res0.params
array([ 8.443046917097718,  1.364996059973237, -5.350750046084954,
       -3.34190287892045 , -0.406458629495091,  0.257727311296307,
        0.119771653524165,  0.023140891929892,  0.002403995206457,
        0.000131618839544,  0.000002990001222])
>>> log_relative_error(res0.params, params_nist[:,0])
array([-0.002491507096328, -0.000213790029725,  0.00100436814061 ,
        0.001288615104161,  0.000498264786078, -0.00148737673275 ,
       -0.004756810105056, -0.009359738327099, -0.015305377783833,
       -0.022566206229652, -0.031085341541384])

Bummer, 0 significant digits, way off.

We can print the full summary of the results, maybe we see something

>>> print res0.summary()
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.996
Model:                            OLS   Adj. R-squared:                  0.995
Method:                 Least Squares   F-statistic:                     2390.
Date:                Sat, 03 Mar 2012   Prob (F-statistic):           1.85e-84
Time:                        23:47:45   Log-Likelihood:                 344.73
No. Observations:                  82   AIC:                            -673.5
Df Residuals:                      74   BIC:                            -654.2
Df Model:                           7
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          8.4430     12.864      0.656      0.514       -17.189    34.075
x1             1.3650      6.496      0.210      0.834       -11.578    14.308
x2            -5.3508      9.347     -0.572      0.569       -23.974    13.273
x3            -3.3419     11.702     -0.286      0.776       -26.659    19.975
x4            -0.4065      5.923     -0.069      0.945       -12.209    11.396
x5             0.2577      1.734      0.149      0.882        -3.197     3.712
x6             0.1198      0.321      0.373      0.710        -0.520     0.759
x7             0.0231      0.038      0.604      0.548        -0.053     0.099
x8             0.0024      0.003      0.838      0.405        -0.003     0.008
x9             0.0001      0.000      1.072      0.287        -0.000     0.000
x10          2.99e-06   2.29e-06      1.303      0.197     -1.58e-06  7.56e-06
==============================================================================
Omnibus:                        1.604   Durbin-Watson:                   1.627
Prob(Omnibus):                  0.449   Jarque-Bera (JB):                1.379
Skew:                          -0.317   Prob(JB):                        0.502
Kurtosis:                       2.961   Cond. No.                        -1.#J
==============================================================================

The smallest eigenvalue is  -0.38. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

R square is 0.996, so we are fitting the curve pretty well, but our design matrix with the polynomial is not positive definite. There is even a negative eigenvalue. A negative eigenvalue sounds strange, a quadratic form shouldn't have them. Just to make sure that this is not a bug, check with numpy

>>> np.linalg.eigvalsh(np.dot(exog0.T, exog0)).min()
-0.38006444279775781
>>> np.linalg.eigvals(np.dot(exog0.T, exog0)).min()
-0.00011161167956978682

I'm still suspicious, but I delay the detour into numpy's and scipy's linalg modules.

One more check of our regression results, the residual standard error is not very far away from the Nist numbers:

>>> np.sqrt(res0.mse_resid), 0.334801051324544E-02,
(0.0038044343586352601, 0.0033480105132454399)

Conclusion: If you try to fit a linear regression with a non-positive definite design matrix, then the parameters are not identified, but we can still get a good fit.

(Technical aside: statsmodels uses by default the generalized inverse, pinv, for linear regression. So it just drops the eigenvalues below a threshold close to zero. The parameter estimates will be closer to a penalized Ridge regression. But don't quote me on the last part since I don't remember where I read that pinv is the limit of a Ridge problem.)

The question for statsmodels is what to do about it.

One solution that works in this case, as we have seen with numpy polynomials, is to rescale the explanatory variables or design matrix. I'm showing one example below. My working title for this post was: Don't do this, or do we have to do it for you? Is it the responsibility of the user not to use a design matrix that numerically doesn't make much sense and we can only warn, or should we automatically transform the design matrix to make it numerically more stable. The latter will be costly and might not be required in 99% of the cases?

Another issue is that there are many different ways to do the linear algebra, and we have not investigated much what might work better or worse in different cases. See Addendum below for the effect that linear algebra can have in numerically unstable problems.

Rescaling the Design Matrix

Our design matrix looks pretty bad, variables vary in a large range and the correlation is very high

>>> np.set_printoptions(precision=3)
>>> print np.max(np.abs(exog0),0)
[  1.000e+00   8.781e+00   7.711e+01   6.772e+02   5.947e+03   5.222e+04
   4.586e+05   4.027e+06   3.536e+07   3.105e+08   2.727e+09]
>>> print np.corrcoef(exog0[:,1:], rowvar=0)
[[ 1.    -0.991  0.969 -0.938  0.904 -0.87   0.838 -0.808  0.782 -0.758]
 [-0.991  1.    -0.993  0.975 -0.951  0.925 -0.899  0.874 -0.851  0.83 ]
 [ 0.969 -0.993  1.    -0.994  0.981 -0.963  0.943 -0.924  0.904 -0.886]
 [-0.938  0.975 -0.994  1.    -0.996  0.986 -0.973  0.958 -0.943  0.928]
 [ 0.904 -0.951  0.981 -0.996  1.    -0.997  0.99  -0.98   0.968 -0.957]
 [-0.87   0.925 -0.963  0.986 -0.997  1.    -0.998  0.992 -0.985  0.976]
 [ 0.838 -0.899  0.943 -0.973  0.99  -0.998  1.    -0.998  0.994 -0.988]
 [-0.808  0.874 -0.924  0.958 -0.98   0.992 -0.998  1.    -0.999  0.995]
 [ 0.782 -0.851  0.904 -0.943  0.968 -0.985  0.994 -0.999  1.    -0.999]
 [-0.758  0.83  -0.886  0.928 -0.957  0.976 -0.988  0.995 -0.999  1.   ]]

Now we can use just the simplest transform, limit the maximum absolute value to be one:

exog1 = exog0 / np.max(np.abs(exog0),0)

After running the regression on the rescaled design matrix, we see an agreement with the NIST benchmark results at around 7 to 8 significant digits both for the parameters and for the standard deviation of the parameter estimates, bse in statsmodels:

>>> res1 = sm.OLS(endog, exog1).fit()
>>> params_rescaled = res1.params / np.max(np.abs(exog0), 0)
>>> log_relative_error(params_rescaled, params_nist[:,0])
array([ 7.419,  7.414,  7.409,  7.402,  7.394,  7.384,  7.373,  7.36 ,
        7.346,  7.331,  7.314])
>>> bse_rescaled = res1.bse / np.max(np.abs(exog0),0)
>>> log_relative_error(bse_rescaled, params_nist[:,1])
array([ 8.512,  8.435,  8.368,  8.308,  8.255,  8.207,  8.164,  8.124,
        8.089,  8.057,  8.028])

Also R squared and the standard deviation of the residuals (using appropriate degrees of freedom) agrees with the NIST numbers at around 10 and 7 digits, resp.

>>> log_relative_error(res1.rsquared, 0.996727416185620)
10.040156510920081

>>> log_relative_error(np.sqrt(res1.mse_resid), 0.334801051324544E-02)
7.8575015681097371

So we are doing pretty well just with a simple rescaling of the variables. Although, the comment at the end of print res1.summary() still reports a smallest eigenvalue of -1.51e-15, so essentially zero. But I worry about this later. I looked initially at another way of rescaling the design matrix but didn't check yet how the choice of the rescaling will affect the results

Addendum 1: Smallest eigenvalue of ill-conditioned array

Going back to the original design matrix without rescaling, define the moment matrix X'X:

>>> xpx0 = np.dot(exog0.T, exog0)

the eigenvalues, assuming a symmetric matrix, are

>>> np.sort(np.linalg.eigvalsh(xpx0))
array([ -3.79709e-01,   1.14869e-05,   4.40507e-03,   3.20670e+00,
         7.91804e+02,   1.05833e+03,   3.98410e+05,   2.31485e+08,
         4.28415e+11,   1.93733e+15,   5.17955e+19])

This looks very badly conditioned. the largest eigenvalue is 5e19, the smallest is "around zero".

We can compare different algorithms to calculate the smallest eigenvalues (splinalg is scipy.linalg)

>>> np.sort(np.linalg.eigvals(xpx0))[:4]
array([  3.41128e-04,   5.58946e-04,   1.23213e-02,   3.33365e+00])
>>> np.sort(splinalg.eigvalsh(xpx0))[:4]
array([ -2.14363e+03,  -2.00323e-01,   1.26094e-05,   4.40956e-03])
>>> np.sort(splinalg.eigvals(xpx0))[:4]
array([ -3.66973e-05+0.j,   1.61750e-04+0.j,   7.90465e-03+0.j,
         2.01592e+00+0.j])

>>> np.sort(np.linalg.svd(xpx0)[1])[:4]
array([  2.84057e-05,   4.91555e-04,   7.28252e-03,   3.41739e+00])
>>> np.sort(splinalg.svd(xpx0)[1])[:4]
array([  2.19202e-05,   7.11920e-04,   7.00790e-03,   3.28229e+00])

>>> np.sort(np.linalg.svd(exog0)[1]**2)[:4]
array([  1.65709e-11,   3.08225e-08,   2.48138e-05,   1.08036e-02])
>>> np.sort(splinalg.svd(exog0)[1]**2)[:4]
array([  1.65708e-11,   3.08225e-08,   2.48138e-05,   1.08036e-02])

So, we see that they are pretty much all over the place, from -0.38 to 2.8e-05. The last version with singular value decomposition is the closest to what statsmodels uses with pinv. It also looks like I picked the worst algorithm for the regression summary in this case.

Warning: Calculations at machine precision are not necessarily deterministic, in the sense that if you run it repeatedly you might not always get the same results. There are several cases on the scipy and numpy mailing lists that report that the results might "randomly" switch between several patterns. And the results won't agree on different operating systems, compilers and versions of the linear algebra libraries. So, I don't expect that these results can be replicated in exactly the same way.

Addendum 2: scipy.linalg versus numpy.linalg

To avoid getting these changing results whenever I re-ran the script while preparing this post, I changed the statsmodels source to use scipy.linalg.pinv instead of numpy.linalg.pinv. I expected more replicable results, however what I found is:

>>> exog0 = dta[:,1:]**np.arange(11)
>>> res0 = sm.OLS(endog, exog0).fit()
>>> log_relative_error(res0.params, params_nist[:,0])
array([ 5.31146488,  5.7400516 ,  6.53794562,  6.81318335,  6.81855769,
        7.22333339,  8.13319742,  7.38788711,  7.24457806,  7.18580677,
        7.12494176])
>>> log_relative_error(res0.bse, params_nist[:,1])
array([ 2.25861611,  2.25837872,  2.25825903,  2.25822427,  2.2582245 ,
        2.25823174,  2.25823693,  2.25823946,  2.25824058,  2.25824108,
        2.25824165])

Just by changing the algorithm that calculates the generalized inverse, I get agreement with the NIST data at 5 to 7 significant digits for the parameters and 2 digits for the standard error of the parameter estimates even with the very ill-conditioned original design matrix. That doesn't look so bad, much better than when using the numpy.linalg version.

(But I need to write proper tests and look at this when I can trust the results. I now have two python sessions open, one that imported the original source, and one that imported the source after changing the statsmodels source. Also, if I run this regression repeatedly the numbers changed once, but remained within the same neighborhood. Besides different algorithm there is also rcond which defines the cutoff in pinv. I didn't check whether that differs in the numpy and scipy versions.)

Conclusions

I think this test case on the NIST site is very well "cooked" to test the numerical accuracy of a linear regression program. The main lesson is that we shouldn't throw a numerically awful problem at a statistical package, unless we know that the package takes care for us of the basic tricks for making the problem numerically more stable. It's safer to make sure our design matrix is numerically sound.

Also, if we just want to estimate a polynomial function, then use the information and use a specialized algorithm, or, even better, use an orthogonal polynomial basis instead of power polynomials.

What does it mean for statistical analysis?

That, I'm not so sure. Multicollinearity is a serious issue, and there a various approaches for dealing with it. But my attitude so far has been:

If you work with real data and run into numerical problems, it's not a problem with numerical accuracy but with your data, or with your model.

We should still use basic precautions like scaling our variables appropriately, but if we have high multicollinearity, then it mainly means that the model that we specified is asking for information that's not in the data. In certain directions the data is not informative enough to reliably identify some parameters. Given measurement errors, noise in the data and misspecification, there are many other parts to worry about before machine precision becomes important. For a related discussion see this thread on the statsmodels mailinglist.

I tried before to come up with a case where standardizing (zscoring) the design matrix helps in improving the precision of the estimates but I didn't manage. Whether I zscored or not, the results where essentially the same. Now, I have a test case to add to statsmodels. I am sceptical about automatic rescaling, but I started a while ago to look into how to make it easier for users to use predefined transforms in statsmodels, instead of having to code them from scratch.

I'm not an expert in numerical analysis and don't intend to become one, my "numerical incompetence" has improved only a bit since this although I know now a bit more linear algebra.

I put a script with the NIST case in this gist. I haven't yet copied over the parts from the interpreter sessions.

A final comment:

I don't like long interpreter sessions, I usually convert everything as fast as possible to a script. For this, I copied everything directly from the session. After cleaning up the original script a bit, I'm getting different numbers for the log relative error (LRE). I'm now using scipy.linalg.pinv inside statsmodels, and LRE is in this case a measure for the behavior at machine precision, and bounces anywhere between 5 and 8. This is a good result in that we can still get estimates with a reasonable precision, but it makes LRE unreliable for replicating the results. I will make a proper script and unittest later, so that I can be more certain about how much the numbers change and whether there isn't a bug somewhere in my "live" session.

Saturday, March 3, 2012

Data "Analysis" in Python

I'm catching up with some Twitter feeds and other information on the internet about the PyData Workshop

There is a big effort in the Python/Numpy/SciPy community to get into the "Big Data" and data processing market.

Even the creator of Python was at the workshop and took not of it.

Guido van Rossum  -  Yesterday 9:05 PM  -  Public
Pandas: a data analysis library for Python, poised to give R a run for its money

I think Python is well suited for this, Python in combination with numpy and scipy has been for 4 years my favorite language for coding for statistics and econometrics. I have been working for several years now on improving "Statistics in Python", both in scipy.stats and statsmodels.

Since the PyData Workshop didn't include anything about statistics or econometrics, it looks like my view is a bit out of mainstream. The blogoshpere is awash with articles about what's hype and what's reality behind BIG DATA. (I don't find the links to the articles I liked, but SAS might have a realistic view Is big data overhyped )

However, what came to my mind reading the buzz surrounding the PyData Workshop is more personal and specific to software developement in Python.

My first thoughts can be roughly summarized with

You know that you are out of date, if

  • you like mailing lists. [1]
  • you signed up for Twitter and never posted anything.
  • you signed up for Google plus and never posted anything.
  • you read the Twitter feed of others once a month.
  • you don't even know how to link to a Twitter message.

You know you don't do the popular things, if

  • you spend two days checking the numerical accuracy of your algorithm for a case with bad data instead of trying to calculate it in the cloud.
  • you spend a week writing test cases verifying your code against other packages, instead of waiting for the bug reports from users.
  • you spend your time figuring out skew, kurtosis and fat tails, and everyone thinks the world is normal, (normally distributed, that is).
  • you think you can to "fancy" econometrics in Python, when users can just use STATA.
  • you think you can to "fancy" statistics in Python, when users can just use R.
  • you think "Data Analysis" requires statistics and econometrics.

You know you are missing the boat (or the point), if

  • "all the best and brightest in the scipy/numpy community are doing a startup" [2], and you are not among them.
  • you are looking for your business plan, and you realize you never came up with one.
  • the "community" of your open source project consists mostly of two developers.
[1]example
[2]from this feed

Friday, March 2, 2012

10 lines of code and it took you 3 weeks?!

A question on the statsmodels mailing list reminded me about some features of software development that seem to occur to me quite frequently.

The question was about estimating the parameters of a statistical distribution based on the characteristic function with the general method of moments (GMM) with a continuum of moment conditions. I'm not going into details, just say the theory behind it is "a bit" technical. The suggested paper is a survey paper that summarizes the basic ideas and several applications. Of course, there are not enough details to base an implementation on a paper like this.

But this is not about GMM, in this case I would be reasonably familiar with it. The case I have in mind is p-value correction for multiple testing.

Close to two years ago there was the question on the statsmodels mailing list about some multiple testing corrections in post-hoc tests, specifically Tukey's_range_test

Never heard, what's that? A quick look at Wikipedia but that's all I know.

But the search also shows that this topic seems to be pretty important. However, Tukey and the other post-hoc tests are a bit old, and looking for "multiple testing" and "multiple comparisons" shows a large number of papers published in recent years, and there is FDR, false discovery rate. Now, what's that?

Fast forward:

Reading and skimming articles, especially survey articles and articles with Monte Carlo Studies, and some documentation of SAS, to figure out what is worth the trouble, and what is not, and then papers that look like they have enough details to base an implementation on it.

Starting to code and trying to translate the verbal descriptions of various step-up and step-down procedures into code, with lots of false starts and convoluted loops or code. (summer break in the coffeshop next to the swimming pool)

Finally, I figure out the pattern, and some R packages to compare my results with. (Christmas break)

The final result is something like

elif method.lower() in ['hs', 'holm-sidak']:
    pvals_corrected_raw = 1 - np.power((1. - pvals), np.arange(ntests, 0, -1))
    pvals_corrected = np.maximum.accumulate(pvals_corrected_raw)

elif method.lower() in ['sh', 'simes-hochberg']:
    pvals_corrected_raw = np.arange(ntests, 0, -1) * pvals
    pvals_corrected = np.minimum.accumulate(pvals_corrected_raw[::-1])[::-1]

available in statsmodels as function multipletests. I didn't know we can do things like np.minimum.accumulate with numpy before.

The code is actually still located in the sandbox although imported through the main parts of statsmodels, since I ran out of steam after getting it to work and verifying it's correctness against R. Actually, I ran out of steam (and Christmas break was over and I had to take care of some tickets for scipy.stats) after getting the cdf for the multivariate t distribution and multiple comparison based on the full covariance matrix, kind of, to work. But here I found a helpful book just published for this for R.

So, since then statsmodels has a function where part of the docstring is

method : string
    Method used for testing and adjustment of pvalues. Can be either the
    full name or initial letters. Available methods are ::

    `bonferroni` : one-step correction
    `sidak` : on-step correction
    `holm-sidak` :
    `holm` :
    `simes-hochberg` :
    `hommel` :
    `fdr_bh` : Benjamini/Hochberg
    `fdr_by` : Benjamini/Yekutieli

and some other functions.

PS: The title is exagerated, it's a few hundred lines of code, the sandbox module has close to 2000 lines that is partly not cleaned up and verified, but might be useful when I work on this or similar again. The story is correct in the sense that I spend three weeks or more with maybe 4 or 5 hours a day to get the few lines of code that is at the core of these functions, and to understand what I was doing.

PS: This was about code developement, multiple testing deserves its own story, and more emphasis and more advertising which I only did on the statsmodels mailing list. I recently read Joseph P. Romano, Michael Wolf : Stepwise Multiple Testing as Formalized Data Snooping which is interesting also in terms of application stories in economics and finance, but I haven't implemented their approach yet.

PS: This would be a lot easier if I were writing GPL instead of BSD licensed code, since then I could just check the R source.

PS: I don't have to work so long just to get started with the right implementation, if the problem is more traditional econometrics. There, it's usually just checking which variation people use and some details.

PS: There are too many PS.