{
"metadata": {
"name": "",
"signature": "sha256:3de4720b58999a1f88844021c43acd1d6d6db6da3315538f9faac86a69424446"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"%load_ext watermark \n",
"%watermark -d -v -a 'Sebastian Raschka' -p numpy,pandas"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"The watermark extension is already loaded. To reload it, use:\n",
" %reload_ext watermark\n",
"Sebastian Raschka 24/12/2014 \n",
"\n",
"CPython 3.4.2\n",
"IPython 2.3.1\n",
"\n",
"numpy 1.9.1\n",
"pandas 0.15.2\n"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"4 Simple Tricks To Speed up the Sum Calculation in Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I wanted to improve the performance of some passages in my code a little bit and found that some simple tweaks can speed up the `pandas` section significantly. I thought that it might be one useful thing to share -- and no Cython or just-in-time compilation is required! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In my case, I had a large dataframe where I wanted to calculate the sum of specific columns for different combinations of rows (approx. 100,000,000 of them, that's why I was looking for ways to speed it up). Anyway, below is a simple toy DataFrame to explore the `.sum()` method a little bit."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"df = pd.DataFrame()\n",
"\n",
"for col in ('a', 'b', 'c', 'd'):\n",
" df[col] = pd.Series(range(1000), index=range(1000))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df.tail()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n", " | a | \n", "b | \n", "c | \n", "d | \n", "
---|---|---|---|---|
995 | \n", "995 | \n", "995 | \n", "995 | \n", "995 | \n", "
996 | \n", "996 | \n", "996 | \n", "996 | \n", "996 | \n", "
997 | \n", "997 | \n", "997 | \n", "997 | \n", "997 | \n", "
998 | \n", "998 | \n", "998 | \n", "998 | \n", "998 | \n", "
999 | \n", "999 | \n", "999 | \n", "999 | \n", "999 | \n", "