plots

2025-02-18 22:32:09 +00:00 · 2014-12-24 11:01:30 -05:00 · 2014-12-24 11:01:30 -05:00 · 731425d794
commit 731425d794
parent 9fc9d1ee54
2 changed files with 773 additions and 9 deletions
--- a/benchmarks/pandas_sum_tricks.ipynb
+++ b/benchmarks/pandas_sum_tricks.ipynb
--- a/pandas_sum_tricks.ipynb
+++ b/pandas_sum_tricks.ipynb
@ -0,0 +1,450 @@
+{
+ "metadata": {
+  "name": "",
+  "signature": "sha256:8222de4af96dc6569eddec8d75df6855e8bac273e12e8739fffc42aafd712ba2"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%load_ext watermark \n",
+      "%watermark -d -v -a 'Sebastian Raschka' -p numpy,pandas"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "Sebastian Raschka 23/12/2014 \n",
+        "\n",
+        "CPython 3.4.2\n",
+        "IPython 2.3.1\n",
+        "\n",
+        "numpy 1.9.1\n",
+        "pandas 0.15.2\n"
+       ]
+      }
+     ],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>\n",
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 1,
+     "metadata": {},
+     "source": [
+      "4 Simple Tricks To Speed up the Sum Calculation in Pandas"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "I wanted to improve the performance of some passages in my code a little bit and found that some simple tweaks can speed up the  `pandas` section significantly. I thought that it might be one useful thing to share -- and no Cython or just-in-time compilation is required! "
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>\n",
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "In my case, I had a large dataframe where I wanted to calculate the sum of specific columns for different combinations of rows (approx. 100,000,000 of them, that's why I was looking for ways to speed it up). Anyway, below is a simple toy DataFrame to explore the `.sum()` method a little bit."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import pandas as pd\n",
+      "import numpy as np\n",
+      "\n",
+      "df = pd.DataFrame()\n",
+      "\n",
+      "for col in ('a', 'b', 'c', 'd'):\n",
+      "    df[col] = pd.Series(range(1000), index=range(1000))"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 2
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "df.tail()"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "html": [
+        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
+        "<table border=\"1\" class=\"dataframe\">\n",
+        "  <thead>\n",
+        "    <tr style=\"text-align: right;\">\n",
+        "      <th></th>\n",
+        "      <th>a</th>\n",
+        "      <th>b</th>\n",
+        "      <th>c</th>\n",
+        "      <th>d</th>\n",
+        "    </tr>\n",
+        "  </thead>\n",
+        "  <tbody>\n",
+        "    <tr>\n",
+        "      <th>995</th>\n",
+        "      <td> 995</td>\n",
+        "      <td> 995</td>\n",
+        "      <td> 995</td>\n",
+        "      <td> 995</td>\n",
+        "    </tr>\n",
+        "    <tr>\n",
+        "      <th>996</th>\n",
+        "      <td> 996</td>\n",
+        "      <td> 996</td>\n",
+        "      <td> 996</td>\n",
+        "      <td> 996</td>\n",
+        "    </tr>\n",
+        "    <tr>\n",
+        "      <th>997</th>\n",
+        "      <td> 997</td>\n",
+        "      <td> 997</td>\n",
+        "      <td> 997</td>\n",
+        "      <td> 997</td>\n",
+        "    </tr>\n",
+        "    <tr>\n",
+        "      <th>998</th>\n",
+        "      <td> 998</td>\n",
+        "      <td> 998</td>\n",
+        "      <td> 998</td>\n",
+        "      <td> 998</td>\n",
+        "    </tr>\n",
+        "    <tr>\n",
+        "      <th>999</th>\n",
+        "      <td> 999</td>\n",
+        "      <td> 999</td>\n",
+        "      <td> 999</td>\n",
+        "      <td> 999</td>\n",
+        "    </tr>\n",
+        "  </tbody>\n",
+        "</table>\n",
+        "</div>"
+       ],
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 3,
+       "text": [
+        "       a    b    c    d\n",
+        "995  995  995  995  995\n",
+        "996  996  996  996  996\n",
+        "997  997  997  997  997\n",
+        "998  998  998  998  998\n",
+        "999  999  999  999  999"
+       ]
+      }
+     ],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Let's assume we are interested in calculating the sum of column `a`, `c`, and `d`, which would look like this:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "df.loc[:, ['a', 'c', 'd']].sum(axis=0)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 4,
+       "text": [
+        "a    499500\n",
+        "c    499500\n",
+        "d    499500\n",
+        "dtype: int64"
+       ]
+      }
+     ],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Now, the `.loc` method is probably the most \"costliest\" one for this kind of operation. Since we are only intersted in the resulting numbers (i.e., the column sums), there is no need to make a copy of the array. Anyway, let's use the method above as a reference for comparison:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# 1\n",
+      "%timeit -n 1000 -r 5 df.loc[:, ['a', 'c', 'd']].sum(axis=0)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "1000 loops, best of 5: 1.28 ms per loop\n"
+       ]
+      }
+     ],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Although this is a rather small DataFrame (1000 x 4), let's see by how much we can speed it up using a different slicing method:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# 2\n",
+      "%timeit -n 1000 -r 5 df[['a', 'c', 'd']].sum(axis=0)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "1000 loops, best of 5: 1.03 ms per loop\n"
+       ]
+      }
+     ],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Next, let us use the Numpy representation of  the `NDFrame` via the `.values` attribue:"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# 3\n",
+      "%timeit -n 1000 -r 5 df[['a', 'c', 'd']].values.sum(axis=0)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "1000 loops, best of 5: 721 \u00b5s per loop\n"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "While the speed improvements in #2 and #3 were not really a surprise, the next \"trick\" surprised me a little bit. Here, we are calculating the sum of each column separately rather than slicing the array."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "[df[col].values.sum(axis=0) for col in ('a', 'c', 'd')]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 8,
+       "text": [
+        "[499500, 499500, 499500]"
+       ]
+      }
+     ],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# 4\n",
+      "%timeit -n 1000 -r 5 [df[col].values.sum(axis=0) for col in ('a', 'c', 'd')]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "1000 loops, best of 5: 64.8 \u00b5s per loop\n"
+       ]
+      }
+     ],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "In this case, this is an almost 10x improvement!"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "One more thing: Let's try the Einstein summation convention [`einsum`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html)."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "from numpy import einsum\n",
+      "[einsum('i->', df[col].values) for col in ('a', 'c', 'd')]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 10,
+       "text": [
+        "[499500, 499500, 499500]"
+       ]
+      }
+     ],
+     "prompt_number": 10
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# 5\n",
+      "%timeit -n 1000 -r 5 [einsum('i->', df[col].values) for col in ('a', 'c', 'd')]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "1000 loops, best of 5: 57.2 \u00b5s per loop\n"
+       ]
+      }
+     ],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>"
+     ]
+    },
+    {
+     "cell_type": "heading",
+     "level": 3,
+     "metadata": {},
+     "source": [
+      "Conclusion:"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Using some simple tricks, the column sum calculation improved from 1280 to 57.2 \u00b5s per loop (approx. 22x faster!)"
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "<br>"
+     ]
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}