python_reference/pandas_sum_tricks.ipynb
2014-12-24 11:01:30 -05:00

450 lines
10 KiB
Plaintext

{
"metadata": {
"name": "",
"signature": "sha256:8222de4af96dc6569eddec8d75df6855e8bac273e12e8739fffc42aafd712ba2"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"%load_ext watermark \n",
"%watermark -d -v -a 'Sebastian Raschka' -p numpy,pandas"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Sebastian Raschka 23/12/2014 \n",
"\n",
"CPython 3.4.2\n",
"IPython 2.3.1\n",
"\n",
"numpy 1.9.1\n",
"pandas 0.15.2\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"4 Simple Tricks To Speed up the Sum Calculation in Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I wanted to improve the performance of some passages in my code a little bit and found that some simple tweaks can speed up the `pandas` section significantly. I thought that it might be one useful thing to share -- and no Cython or just-in-time compilation is required! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In my case, I had a large dataframe where I wanted to calculate the sum of specific columns for different combinations of rows (approx. 100,000,000 of them, that's why I was looking for ways to speed it up). Anyway, below is a simple toy DataFrame to explore the `.sum()` method a little bit."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"df = pd.DataFrame()\n",
"\n",
"for col in ('a', 'b', 'c', 'd'):\n",
" df[col] = pd.Series(range(1000), index=range(1000))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.tail()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" <th>d</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>995</th>\n",
" <td> 995</td>\n",
" <td> 995</td>\n",
" <td> 995</td>\n",
" <td> 995</td>\n",
" </tr>\n",
" <tr>\n",
" <th>996</th>\n",
" <td> 996</td>\n",
" <td> 996</td>\n",
" <td> 996</td>\n",
" <td> 996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>997</th>\n",
" <td> 997</td>\n",
" <td> 997</td>\n",
" <td> 997</td>\n",
" <td> 997</td>\n",
" </tr>\n",
" <tr>\n",
" <th>998</th>\n",
" <td> 998</td>\n",
" <td> 998</td>\n",
" <td> 998</td>\n",
" <td> 998</td>\n",
" </tr>\n",
" <tr>\n",
" <th>999</th>\n",
" <td> 999</td>\n",
" <td> 999</td>\n",
" <td> 999</td>\n",
" <td> 999</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
" a b c d\n",
"995 995 995 995 995\n",
"996 996 996 996 996\n",
"997 997 997 997 997\n",
"998 998 998 998 998\n",
"999 999 999 999 999"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's assume we are interested in calculating the sum of column `a`, `c`, and `d`, which would look like this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.loc[:, ['a', 'c', 'd']].sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"a 499500\n",
"c 499500\n",
"d 499500\n",
"dtype: int64"
]
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, the `.loc` method is probably the most \"costliest\" one for this kind of operation. Since we are only intersted in the resulting numbers (i.e., the column sums), there is no need to make a copy of the array. Anyway, let's use the method above as a reference for comparison:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 1\n",
"%timeit -n 1000 -r 5 df.loc[:, ['a', 'c', 'd']].sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 5: 1.28 ms per loop\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although this is a rather small DataFrame (1000 x 4), let's see by how much we can speed it up using a different slicing method:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 2\n",
"%timeit -n 1000 -r 5 df[['a', 'c', 'd']].sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 5: 1.03 ms per loop\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let us use the Numpy representation of the `NDFrame` via the `.values` attribue:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 3\n",
"%timeit -n 1000 -r 5 df[['a', 'c', 'd']].values.sum(axis=0)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 5: 721 \u00b5s per loop\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While the speed improvements in #2 and #3 were not really a surprise, the next \"trick\" surprised me a little bit. Here, we are calculating the sum of each column separately rather than slicing the array."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[df[col].values.sum(axis=0) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"[499500, 499500, 499500]"
]
}
],
"prompt_number": 8
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 4\n",
"%timeit -n 1000 -r 5 [df[col].values.sum(axis=0) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 5: 64.8 \u00b5s per loop\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, this is an almost 10x improvement!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One more thing: Let's try the Einstein summation convention [`einsum`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from numpy import einsum\n",
"[einsum('i->', df[col].values) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": [
"[499500, 499500, 499500]"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# 5\n",
"%timeit -n 1000 -r 5 [einsum('i->', df[col].values) for col in ('a', 'c', 'd')]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1000 loops, best of 5: 57.2 \u00b5s per loop\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Conclusion:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using some simple tricks, the column sum calculation improved from 1280 to 57.2 \u00b5s per loop (approx. 22x faster!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
}
],
"metadata": {}
}
]
}