"# Implementing the least squares fit method for linear regression and speeding it up via Cython"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a name=\"sections\"></a>\n",
"<br>\n",
"<br>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Sections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Introduction](#introduction)\n",
"- [Least squares fit implementations](#implementations)\n",
"- [Generating sample data and benchmarking](#sample_data)\n",
"- [Compiling the Python code via Cython in the IPython notebook](#cython_nb)\n",
"- [Bonus: How to use Cython without the IPython magic](#cython_bonus)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a name=\"introduction\"></a>\n",
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Linear regression via the least squares method is the simplest approach to performing a regression analysis of a dependent and a explanatory variable. The objective is to find the best-fitting straight line through a set of points that minimizes the sum of the squared offsets from the line. \n",
"The offsets come in 2 different flavors: perpendicular and vertical - with respect to the line. \n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" x_avg = sum(x)/len(x)\n",
" y_avg = sum(y)/len(y)\n",
" var_x = sum([(x_i - x_avg)**2 for x_i in x])\n",
" cov_xy = sum([(x_i - x_avg)*(y_i - y_avg) for x_i,y_i in zip(x,y)])\n",
" slope = cov_xy / var_x\n",
" y_interc = y_avg - slope*x_avg\n",
" return (slope, y_interc)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Using the lstsq numpy function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For our convenience, `numpy` has a function that can also compute the leat squares solution of a linear matrix equation. For more information, please refer to the [documentation](http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def numpy_lstsqr(x, y):\n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" X = np.vstack([x, np.ones(len(x))]).T\n",
" return np.linalg.lstsq(X,y)[0]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Using the linregress scipy function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last approach is using `scipy.stats.linregress()`, which returns a tuple of 5 different attributes, where the 1st value in the tuple is the slope, and the second value is the y-axis intercept, respectively. The documentation for this function can be found [here](http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.stats.linregress.html)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import scipy.stats\n",
"\n",
"def scipy_lstsqr(x,y):\n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" return scipy.stats.linregress(x, y)[0:2]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a name='sample_data'></a>\n",
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Generating sample data and benchmarking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to test our different least squares fit implementation, we will generate some sample data:\n",
"- 500 sample points for the x-component within the range [0,500) \n",
"- 500 sample points for the y-component within the range [100,600) \n",
"\n",
"where each sample point is multiplied by a random value within\n",
"the range [0.8, 12)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import random\n",
"random.seed(12345)\n",
"\n",
"x = [x_i*random.randrange(8,12)/10 for x_i in range(500)]\n",
"y = [y_i*random.randrange(8,12)/10 for y_i in range(100,600)]"
"To check how our dataset is distributed, and how the straight line, which we obtain via the least square fit method, we will plot it in a scatter plot. \n",
"Note that we are using our \"matrix approach\" here for simplicity, but after plotting the data, we will check whether all of the four different implementations yield the same parameters."
"#### Comparing the results from the different implementations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned above, let us confirm that the different implementation computed the same parameters (i.e., slope and y-axis intercept) as solution for the linear equation."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import prettytable\n",
"\n",
"params = [appr(x,y) for appr in [lin_lstsqr_mat, classic_lstsqr, numpy_lstsqr, scipy_lstsqr]]\n",
"For a first impression how the performances of the different least squares implementations compare against each other, let us do a quick benchmark using the `timeit` module via IPython's `%timeit` magic."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import timeit\n",
"\n",
"for lab,appr in zip([\"matrix approach\",\"classic approach\",\n",
" \"numpy function\",\"scipy function\"],\n",
" [lin_lstsqr_mat, classic_lstsqr, \n",
" numpy_lstsqr, scipy_lstsqr]):\n",
" print(\"\\n{}: \".format(lab), end=\"\")\n",
" %timeit appr(x, y)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"matrix approach: 1000 loops, best of 3: 270 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"classic approach: 100 loops, best of 3: 2.83 ms per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"numpy function: 1000 loops, best of 3: 372 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"scipy function: 1000 loops, best of 3: 594 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"The timing above indicates, that the \"classic\" approach (Python's standard library functions only) is significantly worse in performance than the other implemenations - roughly by a magnitude of 10."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>\n",
"<a name=\"cython_nb\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Compiling the Python code via Cython in the IPython notebook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Maybe we can speed things up a little bit via [Cython's C-extensions for Python](http://cython.org). Cython is basically a hybrid between C and Python and can be pictured as compiled Python code with type declarations. \n",
"Since we are working in an IPython notebook here, we can make use of the IPython magic: It will automatically convert it to C code, compile it, and load the function. \n",
"Just to be thorough, let us also compile the other functions, which already use numpy objects."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%load_ext cythonmagic"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%cython\n",
"import numpy as np\n",
"import scipy.stats\n",
"cimport numpy as np\n",
"\n",
"def cy_lin_lstsqr_mat(x, y):\n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" x_avg = sum(x)/len(x)\n",
" y_avg = sum(y)/len(y)\n",
" var_x = sum([(x_i - x_avg)**2 for x_i in x])\n",
" cov_xy = sum([(x_i - x_avg)*(y_i - y_avg) for x_i,y_i in zip(x,y)])\n",
" slope = cov_xy / var_x\n",
" y_interc = y_avg - slope*x_avg\n",
" return (slope, y_interc)\n",
"\n",
"def cy_numpy_lstsqr(x, y):\n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" X = np.vstack([x, np.ones(len(x))]).T\n",
" return np.linalg.lstsq(X,y)[0]\n",
"\n",
"def cy_scipy_lstsqr(x,y):\n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" return scipy.stats.linregress(x, y)[0:2]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Comparing the compiled Cython code to the original Python code"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import timeit\n",
"\n",
"for lab,appr in zip([\"matrix approach\",\"classic approach\",\n",
" \"numpy function\",\"scipy function\"],\n",
" [(lin_lstsqr_mat, cy_lin_lstsqr_mat), \n",
" (classic_lstsqr, cy_classic_lstsqr),\n",
" (numpy_lstsqr, cy_numpy_lstsqr),\n",
" (scipy_lstsqr, cy_scipy_lstsqr)]):\n",
" print(\"\\n\\n{}: \".format(lab))\n",
" %timeit appr[0](x, y)\n",
" %timeit appr[1](x, y)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"matrix approach: \n",
"1000 loops, best of 3: 274 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"1000 loops, best of 3: 260 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"\n",
"classic approach: \n",
"100 loops, best of 3: 2.82 ms per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"1000 loops, best of 3: 212 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"\n",
"numpy function: \n",
"1000 loops, best of 3: 379 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"1000 loops, best of 3: 370 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"\n",
"\n",
"scipy function: \n",
"1000 loops, best of 3: 608 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"100 loops, best of 3: 613 \u00b5s per loop"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>\n",
"As we've seen before, our \"classic\" implementation of the least square method is pretty slow compared to using numpy's functions. This is not surprising, since numpy is highly optmized and uses compiled C/C++ and Fortran code already. This explains why there is no significant difference if we used Cython to compile the numpy-objects-containing functions. \n",
"However, we were able to speed up the \"classic approach\" quite significantly, roughly by 1500%.\n",
"\n",
"The following 2 code blocks are just to visualize our results in a bar plot."
"## Bonus: How to use Cython without the IPython magic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"IPython's notebook is really great for explanatory analysis and documentation, but what if we want to compile our Python code via Cython without letting IPython's magic doing all the work? \n",
"These are the steps you would need."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 1. Creating a .pyx file containing the the desired code or function."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%%file ccy_classic_lstsqr.pyx\n",
"\n",
"def ccy_classic_lstsqr(x, y):\n",
" \"\"\" Computes the least-squares solution to a linear matrix equation. \"\"\"\n",
" x_avg = sum(x)/len(x)\n",
" y_avg = sum(y)/len(y)\n",
" var_x = sum([(x_i - x_avg)**2 for x_i in x])\n",
" cov_xy = sum([(x_i - x_avg)*(y_i - y_avg) for x_i,y_i in zip(x,y)])\n",