python_reference/benchmarks/pandas_sum_tricks.ipynb

764 lines
110 KiB
Plaintext
Raw Normal View History

2014-12-24 03:33:39 +00:00
{
"metadata": {
"name": "",
2014-12-24 16:01:30 +00:00
"signature": "sha256:3de4720b58999a1f88844021c43acd1d6d6db6da3315538f9faac86a69424446"
2014-12-24 03:33:39 +00:00
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"%load_ext watermark \n",
"%watermark -d -v -a 'Sebastian Raschka' -p numpy,pandas"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
2014-12-24 16:01:30 +00:00
"The watermark extension is already loaded. To reload it, use:\n",
" %reload_ext watermark\n",
"Sebastian Raschka 24/12/2014 \n",
2014-12-24 03:33:39 +00:00
"\n",
"CPython 3.4.2\n",
"IPython 2.3.1\n",
"\n",
"numpy 1.9.1\n",
"pandas 0.15.2\n"
]
}
],
2014-12-24 16:01:30 +00:00
"prompt_number": 18
2014-12-24 03:33:39 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"4 Simple Tricks To Speed up the Sum Calculation in Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I wanted to improve the performance of some passages in my code a little bit and found that some simple tweaks can speed up the `pandas` section significantly. I thought that it might be one useful thing to share -- and no Cython or just-in-time compilation is required! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In my case, I had a large dataframe where I wanted to calculate the sum of specific columns for different combinations of rows (approx. 100,000,000 of them, that's why I was looking for ways to speed it up). Anyway, below is a simple toy DataFrame to explore the `.sum()` method a little bit."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"df = pd.DataFrame()\n",
"\n",
"for col in ('a', 'b', 'c', 'd'):\n",
" df[col] = pd.Series(range(1000), index=range(1000))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.tail()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>d</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>995</th>\n",
" <td> 995</td>\n",
" <td> 995</td>\n",
" <td> 995</td>\n",
" <td> 995</td>\n",
" </tr>\n",
" <tr>\n",
" <th>996</th>\n",
" <td> 996</td>\n",
" <td> 996</td>\n",
" <td> 996</td>\n",
" <td> 996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>997</th>\n",
" <td> 997</td>\n",
" <td> 997</td>\n",
" <td> 997</td>\n",
" <td> 997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>998</th>\n",
" <td> 998</td>\n",
" <td> 998</td>\n",
" <td> 998</td>\n",
" <td> 998</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999</th>\n",
" <td> 999</td>\n",
" <td> 999</td>\n",
" <td> 999</td>\n",
" <td> 999</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
" a b c d\n",
"995 995 995 995 995\n",
"996 996 996 996 996\n",
"997 997 997 997 997\n",
"998 998 998 998 998\n",
"999 999 999 999 999"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's assume we are interested in calculating the sum of column `a`, `c`, and `d`, which would look like this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.loc[:, ['a', 'c', 'd']].sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"a 499500\n",
"c 499500\n",
"d 499500\n",
"dtype: int64"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, the `.loc` method is probably the most \"costliest\" one for this kind of operation. Since we are only intersted in the resulting numbers (i.e., the column sums), there is no need to make a copy of the array. Anyway, let's use the method above as a reference for comparison:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 1\n",
"%timeit -n 1000 -r 5 df.loc[:, ['a', 'c', 'd']].sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
2014-12-24 16:01:30 +00:00
"1000 loops, best of 5: 1.37 ms per loop\n"
2014-12-24 03:33:39 +00:00
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although this is a rather small DataFrame (1000 x 4), let's see by how much we can speed it up using a different slicing method:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 2\n",
"%timeit -n 1000 -r 5 df[['a', 'c', 'd']].sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
2014-12-24 16:01:30 +00:00
"1000 loops, best of 5: 986 \u00b5s per loop\n"
2014-12-24 03:33:39 +00:00
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let us use the Numpy representation of the `NDFrame` via the `.values` attribue:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 3\n",
"%timeit -n 1000 -r 5 df[['a', 'c', 'd']].values.sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
2014-12-24 16:01:30 +00:00
"1000 loops, best of 5: 687 \u00b5s per loop\n"
2014-12-24 03:33:39 +00:00
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While the speed improvements in #2 and #3 were not really a surprise, the next \"trick\" surprised me a little bit. Here, we are calculating the sum of each column separately rather than slicing the array."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[df[col].values.sum(axis=0) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"[499500, 499500, 499500]"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 4\n",
"%timeit -n 1000 -r 5 [df[col].values.sum(axis=0) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
2014-12-24 16:01:30 +00:00
"1000 loops, best of 5: 64.4 \u00b5s per loop\n"
2014-12-24 03:33:39 +00:00
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, this is an almost 10x improvement!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One more thing: Let's try the Einstein summation convention [`einsum`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from numpy import einsum\n",
"[einsum('i->', df[col].values) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"[499500, 499500, 499500]"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 5\n",
"%timeit -n 1000 -r 5 [einsum('i->', df[col].values) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
2014-12-24 16:01:30 +00:00
"1000 loops, best of 5: 55.7 \u00b5s per loop\n"
2014-12-24 03:33:39 +00:00
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Conclusion:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-12-24 16:01:30 +00:00
"Using some simple tricks, the column sum calculation improved from 1370 to 57.7 \u00b5s per loop (approx. 25x faster!)"
2014-12-24 03:33:39 +00:00
]
},
2014-12-24 16:01:30 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"What about larger DataFrames?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, what does this trend look like for larger DataFrames?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import timeit\n",
"import random\n",
"from numpy import einsum\n",
"import pandas as pd\n",
"\n",
"def run_loc_sum(df):\n",
" return df.loc[:, ['a', 'c', 'd']].sum(axis=0)\n",
"\n",
"def run_einsum(df):\n",
" return [einsum('i->', df[col].values) for col in ('a', 'c', 'd')]\n",
"\n",
"orders = [10**i for i in range(4, 8)]\n",
"loc_res = []\n",
"einsum_res = []\n",
"\n",
"for n in orders:\n",
"\n",
" df = pd.DataFrame()\n",
" for col in ('a', 'b', 'c', 'd'):\n",
" df[col] = pd.Series(range(n), index=range(n))\n",
" \n",
" print('n=%s (%s of %s)' %(n, orders.index(n)+1, len(orders)))\n",
"\n",
" loc_res.append(min(timeit.Timer('run_loc_sum(df)' , \n",
" 'from __main__ import run_loc_sum, df').repeat(repeat=5, number=1)))\n",
"\n",
" einsum_res.append(min(timeit.Timer('run_einsum(df)' , \n",
" 'from __main__ import run_einsum, df').repeat(repeat=5, number=1)))\n",
"\n",
"print('finished')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"n=10000 (1 of 4)\n",
"n=100000 (2 of 4)"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"n=1000000 (3 of 4)"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"n=10000000 (4 of 4)"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"finished"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 23
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%matplotlib inline"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from matplotlib import pyplot as plt\n",
"\n",
"def plot_1():\n",
" \n",
" fig = plt.figure(figsize=(12,6))\n",
" \n",
" plt.plot(orders, loc_res, \n",
" label=\"df.loc[:, ['a', 'c', 'd']].sum(axis=0)\", \n",
" lw=2, alpha=0.6)\n",
" plt.plot(orders,einsum_res, \n",
" label=\"[einsum('i->', df[col].values) for col in ('a', 'c', 'd')]\", \n",
" lw=2, alpha=0.6)\n",
"\n",
" plt.title('Pandas Column Sums', fontsize=20)\n",
" plt.xlim([min(orders), max(orders)])\n",
" plt.grid()\n",
"\n",
" #plt.xscale('log')\n",
" plt.ticklabel_format(style='plain', axis='x')\n",
" plt.legend(loc='upper left', fontsize=14)\n",
" plt.xlabel('Number of rows', fontsize=16)\n",
" plt.ylabel('time in seconds', fontsize=16)\n",
" \n",
" plt.tight_layout()\n",
" plt.show()\n",
" \n",
"plot_1()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAA1cAAAGpCAYAAABhxcywAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd4VFX+x/H3SSihJDSpkhCKVEEIRYUICazYUNdVWYpS\nbKyKLq4FAV2IKOiKLBZsiwKKKLpi/+EqQghBpYsg0oQQJRQFpHfO748zGVMmPZlJ+byeJ0+45965\n99yZM2S+c875HmOtRURERERERAomKNAVEBERERERKQ0UXImIiIiIiBQCBVciIiIiIiKFQMGViIiI\niIhIIVBwJSIiIiIiUggUXImIiIiIiBQCBVciIqWEMWaGMeasMSYi0HXxF2NMvDHmbKDrISIiAgqu\nRESy5QlW0v6cNsb8aoz5yhjTP9D186FYLV5ojAk3xjxpjFlpjNlvjDlpjNltjPnSGHOvMSasEC5T\nrO65KBhjOhtj3jLGbDfGHDfGHDDGbDHGfGKMedAYUznQdRQRESgX6AqIiJQAFojz/Ls80Aq4Fog1\nxnSy1t4fsJplZgJdgVTGmNuAF4AKwHfAW8B+oCbQDZgCPArUDlQdSwJjzE3ATFw7XAC8DxwDIoFo\n4EpP2dYAVVFERDwUXImI5IK19rG028aYnsCXwAhjzHPW2u2BqVnxZIwZCLwK7ANuttbO83HMhcBU\nf9etJPH0SE0FzgCXWWsX+jjmImCvv+smIiKZaVigiEg+WGsXABtxPUWdAIwxfzbGzDLGbDLGHPb8\nrDDG3GOMydSjlGaOVCNjzDBjzFpjzDFjzC5jzCtZDZkzxvzJGLPYGHPEGLPXGPOBMaZlVnU1xgwx\nxrxvjNlqjDnqGVKW6AmAfB3fxBjzqmfY2VHPNb43xrxkjKmZ03NjjAkFnsP1tPTzFVh5nsOlQFcf\nj+9ljPncGLPPMwRuozFmYm6HEHru96wxZnAW+88aYxZmKBvnKe9hjOnved2OGGNSjDHPGGMqeI77\nkzFmkTHmoGeY45u+nhNjTJIxZpsxprIx5mljTLLnXjYbYx7KzX14nA+EAut8BVYA1tpvrbUH0lw7\n0nMv07O4/0zz1IwxMZ7HjDXGdPI8/7977vF9Y0y457imxpg5nqGxR40xC40x7Xxco64xZpLntTvs\nOc8GY8x0Y0zjPNy/iEiJop4rEZH8Sw2YUuf8TMT1MHwD7ACqAb2AZ4HOwKAszvM00Bv4GPgc6Anc\nDjTzPP6PCxpzAzAHOO75vRO4BPga+D6L878IrAPiPcefgxtK9qYxpoW19p9pzl8fWI77QP8Z8B4Q\nAjQBbgKex/VGZecGoAbwjbV2fnYHWmtPZri/YcBLwCHPtfcAscBI4GpjTLe0gUQOspuLldW+e4Ar\ngA+AhcBlwH1AbWPMx8As4FPgZdzQxoFALdzzmfH85YEvgPq45/I0cB3wpDEmJGNvaBZ+8/xuYIyp\nbK09movHpK1DXvd1xj3X8biex3a4Orc1xlwHJAA/ADNwwxL/AnxpjGlirT0C3t62Jbg28wXwEe69\nEglcg3tdt+XhPkRESg5rrX70ox/96CeLH+AscMZH+Z88+04D4Z6yxj6OM7gPomeBLhn2pZYnAQ3T\nlAcDizz7Oqcpr4ob/nUCiMpwrsmpdQUiMuzzVa/ywHzgJNAgTfk9nvPc4+MxlYCQXDxnr3nO8Vge\nn+tGnnv7HWieYd9UzzlfyVAen/H1AYZ4jh2UzWu6IEPZOE/5fqBFmvIKuMD0DC6ovCTDa/uF53EX\nZDhfkqf8U6BimvLanmvsB8rl8nlZ6jnXauAuoD1QIZvjIz3Hv57Ffl/PWYznMWeB/hn2TfOU/w6M\nyrDvEc++e9OUXe0pe8bHtcsBVQv6vtSPfvSjn+L6o2GBIiI5M57hUuOMMU8YY/6L62GywBRr7c8A\n1tpM38Zbay1uiBy43ilfHrPW/pLmMWeA1CFdndMcdy2uR2i2tXZVhnOMAw76OnkW9TqF69EqR/re\nsdQejeM+HnPMWpup3If6nt+/ZHtUZjfhgr4XrLWbMuwbAxwGbkodoldEnrPWbkzdsK5nbQ4ukPrY\nWrs4zT6L68kC18OTkcUFHSfSPOZXXA9lNaB5Lut0Ay4gugCXIGQVcNgYs9QY85BnGGZhWWytfTtD\n2UzP773Akxn2veH5fYGPc/lqQ6ettYcLVkURkeJLwwJFRHJnrOe3xfU6LAJes9bOTj3AGFMLeBA3\nRKwJkDE99rlZnHuFj7LUwKRGmrIoz+9FGQ+21h40xnwHdM+4z7h1r0bigqhwXA9UWg3S/PtjYAIw\n1RhzGa5nJtFauz6Luhem1PtbkHGHtfZ3Y8xq3BDIlmQ9BLKgfL0WOz2/V/rYl+L53dDHvgPWWl8Z\n/H72/K7hY18mnuC9p2de3aVAR6ALLvDuDNxljImx1ibl5nw5yO7+v/MElGn5uv943LDYh40xUcA8\nINHzeK1JJiKlmoIrEZGcWWttcHYHGGOq4+YqReKGcc3ADSM7jfsQ/XegYhYP/91H2WnP77TXreb5\nvTuL8+zyUa8mwDKgOm6+zOfAAdwwt8bA4LT1stYmG2O64HrCLsfNqcEY8zMwyVr7fBbXTiv1w7iv\ngCM7qfe3M4v9OzMcVxR8zec6nYt95X3s8/W6pn1Mtm0qI2vtBmBD6rYxpgXwOnAx8G/c3KiCytP9\nW2tPG5erpXyaskPGZTCMw82xusyz6zdjzIvA49ba0xnPJSJSGii4EhEpHLfhAqtxNnPa9otxwVVB\npX64rZvF/no+yv6BW1dqiLX2jbQ7jFsEOVNGPc+H+H7GmGDccK8/4eZiPWuMOWKtfT2Hei4GhuJ6\nyv6Zw7Fppd5ffeBHH/vrZzguK6m9I5n+xnmC4FLBWrvRGHMzsAWX9CNVlvfvUeTPgbV2B+49gTGm\nNS5Jy9249hBE3tqFiEiJoTlXIiKFo5nn9/s+9vUopGukDkuLybjDGFMNl+gg47CtZp6yPNfLWnvG\nWrvKWvsvoL+n+Npc1PO/uF67i40xvbI7MMP8qdR5ZDE+jquOu79j+A680trv+R3hY1+nHB5b0qTO\nX0qb6j/1/sMzHuxJZ5/buV6Fwlq73lr7Am5II+SuDYmIlEgKrkRECkdq0oi0PQgYYzoAowrpGh/h\nPjgPMMZ0zLBvHOBrHahtuA/eGet1GZ6ehQzlUZ5ALaPUXrEcU4F7Ehbc69mcY4zxmcjD06O3NE3R\nLOAUcI8xpmmGw8fj0sPP8iTjyM5yXO/NAGOMd36ZZz2qf+VU/+LEs2bVvb7W+DJuPN4Yz2ZCarm1\n9hBu+GC0MaZVmuODcVklQ4q4zq2NMb56V3PdhkRESioNCxQRKRxv4JJZTDHGxOKGap0HXIXrNepX\n0AtYa48YY+7AZa9bbIyZg5tnFQ20wX3AzpjQ4kXcEL33PFkOd+IWpr0MeBf4a4bjBwF3GGMSga24\nYK4pLr32cWBKLus62xPYvAB87km28Y3nfLVw84TaAb+mecx2Y8wIXNr1VcaYd3HrPPUALsL1WI30\ncbl0CzRba3cZY94Cbga+M8b8Hy7wvAKXDKR9bu6hmKiOe87/ZYxZgltj6hBQBzfUrjFuDt79GR73\nNC4l/hLP634cF2AHA2vwnd2vsPQGnjbGfA1sxq1V1hDXY3XGUzcRkVJJwZWISCGw1u40xlyCS1Ud\njQtefgTuBL7Cd3BlyX6hV1/Xed8Yczkue2Ff3IfmBFzwMQqXTS/t8Ws9wd7juECvHPAdLvnBATIH\nV7Nxazt1xWWlq4TLXDgbt25RrrMGWmtfM8b8DxiOGxI2AKiCC7DWASNwCRnSPuYlY8wW4AHgelzG\nxWRcj9MEa23GdPNZPYe344KO/ri1obbjFnOe5OOesztPTvuyktMCvrk933rca9Ub9xr3xc2hO4IL\n4N8CnrXW7k13AWune3q2/oELmPfhej7H4IL9vN5PXnyOG5LYHZfQIgyXVfB/wGRr7bdFeG0RkYAy\nmbOqFvEFjXkd9wd+j7W2rY/9
"text": [
"<matplotlib.figure.Figure at 0x109989550>"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like that the benefit of calculating the sums separately for each column becomes even larger the more rows the DataFrame has."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another question to ask: How does this scale if we have a growing number of columns?"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import timeit\n",
"import random\n",
"from numpy import einsum\n",
"import pandas as pd\n",
"\n",
"def run_loc_sum(df, n):\n",
" return df.loc[:, 0:n-1].sum(axis=0)\n",
"\n",
"def run_einsum(df, n):\n",
" return [einsum('i->', df[col].values) for col in range(0,n-1)]\n",
"\n",
"orders = [10**i for i in range(2, 5)]\n",
"loc_res = []\n",
"einsum_res = []\n",
"\n",
"for n in orders:\n",
"\n",
" df = pd.DataFrame()\n",
" for col in range(n):\n",
" df[col] = pd.Series(range(1000), index=range(1000))\n",
" \n",
" print('n=%s (%s of %s)' %(n, orders.index(n)+1, len(orders)))\n",
"\n",
" loc_res.append(min(timeit.Timer('run_loc_sum(df, n)' , \n",
" 'from __main__ import run_loc_sum, df, n').repeat(repeat=5, number=1)))\n",
"\n",
" einsum_res.append(min(timeit.Timer('run_einsum(df, n)' , \n",
" 'from __main__ import run_einsum, df, n').repeat(repeat=5, number=1)))\n",
"\n",
"print('finished')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"n=100 (1 of 3)\n",
"n=1000 (2 of 3)"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"n=10000 (3 of 3)"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"finished"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 35
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from matplotlib import pyplot as plt\n",
"\n",
"def plot_2():\n",
" \n",
" fig = plt.figure(figsize=(12,6))\n",
" \n",
" plt.plot(orders, loc_res, \n",
" label=\"df.loc[:, 0:n-1].sum(axis=0)\", \n",
" lw=2, alpha=0.6)\n",
" plt.plot(orders,einsum_res, \n",
" label=\"[einsum('i->', df[col].values) for col in range(0,n-1)]\", \n",
" lw=2, alpha=0.6)\n",
"\n",
" plt.title('Pandas Column Sums', fontsize=20)\n",
" plt.xlim([min(orders), max(orders)])\n",
" plt.grid()\n",
"\n",
" #plt.xscale('log')\n",
" plt.ticklabel_format(style='plain', axis='x')\n",
" plt.legend(loc='upper left', fontsize=14)\n",
" plt.xlabel('Number of columns', fontsize=16)\n",
" plt.ylabel('time in seconds', fontsize=16)\n",
" \n",
" plt.tight_layout()\n",
" plt.show()\n",
" \n",
"plot_2()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
"png": "iVBORw0KGgoAAAANSUhEUgAAA1cAAAGpCAYAAABhxcywAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8VUX+//HXJIQWQhHpBEJTirTQEZCyYC8sioqCwYar\ngq4rILoKiIKF9YdtXRtlFRYXRWX1qyCGEEBUOiIgQaqEKk16m98fc+9NDzch996U9/PxyCOcOefM\nmXPvhNxP5jNzjLUWERERERERuTBhoW6AiIiIiIhIYaDgSkREREREJA8ouBIREREREckDCq5ERERE\nRETygIIrERERERGRPKDgSkREREREJA8ouBIRKSKMMZONMeeMMbVC3ZZgMcYkGGPOhbodIiJSNCi4\nEhG5AJ5gJfXXGWPMXmPMt8aY20Pdvkzkq4cbGmOijTEvGGOWGWMOGGNOGWN2G2O+McYMMcaUzYPL\n5Kt7DgRjTBtjzFRjzFZjzAljzCFjzEZjzP+MMUONMaVD3UYRkaKgWKgbICJSCFhgtOffEUAj4Eag\nmzGmtbX2byFrWUYm1A3wMsbcC7wBFAdWAlOBA8BFwOXABOBpoFKo2lgQGGPuBKbg+mE88AlwHIgB\nOgHXeMo2haiJIiJFhoIrEZE8YK19NvW2MaY78A3wqDHmNWvt1tC0LH8yxtwBvAPsB/pba7/K5Jh2\nwJvBbltB4hmRehM4C1xprZ2XyTHtgd+D3TYRkaJIaYEiIgFgrY0HfsGNFLUGMMbcZIz50BizwRhz\nxPO11Bgz2BiTYUQp1Ryp2saYQcaYn4wxx40xu4wxb2eVMmeM+ZMxZoEx5qgx5ndjzKfGmIZZtdUY\nE2eM+cQYs8kYc8yTUrbQEwBldnxdY8w7nrSzY55rrDbGvGWMueh8r40xJgp4DTfScltmgZXnNfwB\n6JjJ+T2MMV8bY/Z7UuB+McaM8zeF0HO/54wxd2Wx/5wxZl66slGe8iuMMbd73rejxphkY8w/jDHF\nPcf9yRgz3xhz2JPm+EFmr4kxZosxZrMxprQx5mVjzDbPvSQZY4b5cx8elwFRwJrMAisAa+331tpD\nqa4d47mXSVncf4Z5asaYrp5zRhpjWnte/4Oee/zEGBPtOa6eMeYjT2rsMWPMPGNMs0yuUcUYM97z\n3h3x1LPeGDPJGFMnB/cvIpKvaORKRCRwvAGTd87PONwIw2JgB1AO6AG8CrQBBmRRz8tAL2AW8DXQ\nHbgPqO85P+WCxtwMfASc8HzfCXQGvgNWZ1H/P4E1QILn+ItxqWQfGGMutdY+k6r+asAS3Af6L4EZ\nQEmgLnAn8DpuNCo7NwMVgMXW2rnZHWitPZXu/gYBbwF/eK69B+gGDAeuN8ZcnjqQOI/s5mJltW8w\ncDXwKTAPuBL4K1DJGDML+BD4AvgXLrXxDqAi7vVMX38EMAeohnstzwC9gReMMSXTj4ZmYZ/ne3Vj\nTGlr7TE/zkndhpzua4N7rRNwI4/NcG1uaozpDSQCPwOTcWmJfwa+McbUtdYeBd9o2yJcn5kDfI77\nWYkBbsC9r5tzcB8iIvmHtVZf+tKXvvSVyy/gHHA2k/I/efadAaI9ZXUyOc7gPoieA9qm2+ct3wLU\nTFUeDsz37GuTqrwMLv3rJBCbrq5XvG0FaqXbl1m7IoC5wCmgeqrywZ56BmdyTimgpB+v2fueOp7N\n4Wtd23NvB4FL0u1701Pn2+nKE9K/P0Cc59gB2byn8enKRnnKDwCXpiovjgtMz+KCys7p3ts5nvOa\np6tvi6f8C6BEqvJKnmscAIr5+br84KlrBfAg0AIons3xMZ7jJ2axP7PXrKvnnHPA7en2vecpPwiM\nSLfv7559Q1KVXe8p+0cm1y4GlLnQn0t96Utf+grVl9ICRUQunPGkS40yxjxvjPkYN8JkgQnW2u0A\n1toMf4231lpcihy40anMPGut/S3VOWcBb0pXm1TH3YgbEZpmrV2ero5RwOHMKs+iXadxI1rFSDs6\n5h3ROJHJOcettRnKM1HN8/23bI/K6E5c0PeGtXZDun1PAUeAO70pegHymrX2F++GdSNrH+ECqVnW\n2gWp9lncSBa4EZ70LC7oOJnqnL24EcpywCV+tulmXEDUHLdAyHLgiDHmB2PMME8aZl5ZYK39T7qy\nKZ7vvwMvpNv3b8/35pnUlVkfOmOtPXJhTRQRCR2lBYqI5I2Rnu8WN+owH3jfWjvNe4AxpiIwFJci\nVhdIvzx2jSzqXppJmTcwqZCqLNbzfX76g621h40xK4Eu6fcZ99yr4bggKho3ApVa9VT/ngWMBd40\nxlyJG5lZaK1dm0Xb85L3/uLT77DWHjTGrMClQDYk6xTIC5XZe7HT831ZJvuSPd9rZrLvkLU2sxX8\ntnu+V8hkXwae4L27Z15dT6AV0BYXeLcBHjTGdLXWbvGnvvPI7v5XegLK1DK7/wRcWuwTxphY4Ctg\noed8PZNMRAo0BVciIhfOWmvDszvAGFMeN1cpBpfGNRmXRnYG9yH6EaBEFqcfzKTsjOd76uuW83zf\nnUU9uzJpV13gR6A8br7M18AhXJpbHeCu1O2y1m4zxrTFjYRdhZtTgzFmOzDeWvt6FtdOzfthPLOA\nIzve+9uZxf6d6Y4LhMzmc53xY19EJvsye19Tn5Ntn0rPWrseWO/dNsZcCkwEOgD/Dzc36kLl6P6t\ntWeMW6slIlXZH8atYDgaN8fqSs+ufcaYfwLPWWvPpK9LRKQgUHAlIhIc9+ICq1E247LtHXDB1YXy\nfritksX+qpmUPYZ7rlSctfbfqXcY9xDkDCvqeT7E32aMCcele/0JNxfrVWPMUWvtxPO0cwEwEDdS\n9sx5jk3Ne3/VgHWZ7K+W7riseEdHMvwO9ATBhYK19hdjTH9gI27RD68s798j4K+BtXYH7mcCY0xj\n3CItD+H6Qxg56xciIvmG5lyJiARHfc/3TzLZd0UeXcObltY1/Q5jTDncQgfp07bqe8py3C5r7Vlr\n7XJr7UvA7Z7iG/1o58e4UbsOxpge2R2Ybv6Udx5Z10yOK4+7v+NkHnildsDzvVYm+1qf59yCxjt/\nKfVS/977j05/sGc5e3/neuUJa+1aa+0buJRG8K8PiYjkSwquRESCw7toROoRBIwxLYEReXSNz3Ef\nnPsZY1ql2zcKyOw5UJtxH7zTt+tKPCML6cpjPYFaet5RsfMuBe5ZsGCIZ/MjY0ymC3l4RvR+SFX0\nIXAaGGyMqZfu8DG45eE/9CzGkZ0luNGbfsYY3/wyz/OoXjpf+/MTzzOrhmT2jC/j8vGe8mwmesut\ntX/g0gc7GWMapTo+HLeqZMkAt7mxMSaz0VW/+5CISH6ltEARkeD4N24xiwnGmG64VK0GwLW4UaPb\nLvQC1tqjxpj7cavXLTDGfISbZ9UJaIL7gJ1+QYt/4lL0ZnhWOdyJezDtlcB/gVvTHT8AuN8YsxDY\nhAvm6uGW1z4BTPCzrdM8gc0bwNeexTYWe+qriJsn1AzYm+qcrcaYR3HLri83xvwX95ynK4D2uBGr\n4ZlcLs0Dmq21u4wxU4H+wEpjzP/hAs+rcYuBtPDnHvKJ8rjX/CVjzCLcM6b+ACrjUu3q4Obg/S3d\neS/jlsRf5HnfT+AC7HBgFZmv7pdXegEvG2O+A5JwzyqriRuxOutpm4hIgaTgSkQkCKy1O40xnXFL\nVXfCBS/rgL8A35J5cGXJ/kGvmV3nE2PMVbjVC/viPjQn4oKPEbjV9FIf/5Mn2HsOF+gVA1biFj84\nRMbgahru2U4dcavSlcKtXDgN99wiv1cNtNa+b4yZDTyMSwnrB0TiAqw1wKO4BRlSn/OWMWYj8DjQ\nB7fi4jbciNNYa2365eazeg3vwwUdt+OeDbUV9zDn8Zncc3b1nG9fVs73AF9/61uLe6964d7jvrg5\ndEdxAfxU4FVr7e9pLmDtJM/I1mO4gHk/buTzKVywn9P7yYmvcSmJXXALWpTFrSo4G3jFWvt9AK8t\nIhJQJuOqqQG+oPulPwH317H3
"text": [
"<matplotlib.figure.Figure at 0x109334240>"
]
}
],
"prompt_number": 37
},
2014-12-24 03:33:39 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
}
],
"metadata": {}
}
]
}