diff --git a/README.md b/README.md index cfa5708..8b4e165 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,8 @@ - A collection of useful regular expressions [[IPython nb](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/useful_regex.ipynb)] +- Quick guide for dealing with missing numbers in NumPy [[IPython nb](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/numpy_nan_quickguide.ipynb)] +
diff --git a/tutorials/numpy_nan_quickguide.ipynb b/tutorials/numpy_nan_quickguide.ipynb new file mode 100644 index 0000000..dfc9572 --- /dev/null +++ b/tutorials/numpy_nan_quickguide.ipynb @@ -0,0 +1,770 @@ +{ + "metadata": { + "name": "", + "signature": "sha256:7553ded8e8dc9e6faf09cd22747b33a3ae9039743491e88025fb61ea45203063" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to python_reference](https://github.com/rasbt/python_reference)]" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext watermark" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%watermark -v -p numpy -d -u" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "Last updated: 30/07/2014 \n", + "\n", + "CPython 3.4.1\n", + "IPython 2.0.0\n", + "\n", + "numpy 1.8.1\n" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[More information](https://github.com/rasbt/watermark) about the `watermark` magic command extension." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "Quick guide for dealing with missing numbers in NumPy" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is just a quick overview of how to deal with missing values (i.e., \"NaN\"s for \"Not-a-Number\") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!\n", + "\n", + "I would be happy to hear your comments and suggestions. \n", + "Please feel free to drop me a note via\n", + "[twitter](https://twitter.com/rasbt), [email](mailto:bluewoodtree@gmail.com), or [google+](https://plus.google.com/+SebastianRaschka).\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Sections" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- [Sample data from a CSV file](#Sample-data-from-a-CSV-file)\n", + "- [Determining if a value is missing](#Determining-if-a-value-is-missing)\n", + "- [Counting the number of missing values](#Counting-the-number-of-missing-values)\n", + "- [Calculating the sum of an array that contains NaNs](#Calculating the sum of an array that contains NaNs)\n", + "- [Removing all rows that contain missing values](#Removing-all-rows-that-contain-missing-values)\n", + "- [Convert missing values to 0](#Convert-missing-values-to-0)\n", + "- [Converting certain numbers to NaN](#Converting-certain-numbers-to-NaN)\n", + "- [Remove all missing elements from an array](#Remove-all-missing-elements-from-an-array)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Sample data from a CSV file" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's assume that we have a CSV file with missing elements like the one shown below." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%file example.csv\n", + "1,2,3,4\n", + "5,6,,8\n", + "10,11,12," + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "Overwriting example.csv\n" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `np.genfromtxt` function has a `missing_values` parameters which translates missing values into `np.nan` objects by default. This allows us to construct a new NumPy `ndarray` object, even if elements are missing." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "
" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "import numpy as np\n", + "ary = np.genfromtxt('./example.csv', delimiter=',')\n", + "\n", + "print('%s x %s array:\\n' %(ary.shape[0], ary.shape[1]))\n", + "print(ary)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "3 x 4 array:\n", + "\n", + "[[ 1. 2. 3. 4.]\n", + " [ 5. 6. nan 8.]\n", + " [ 10. 11. 12. nan]]\n" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Determining if a value is missing" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A handy function to test whether a value is a `NaN` or not is to use the `np.isnan` function." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "np.isnan(np.nan)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 37, + "text": [ + "True" + ] + } + ], + "prompt_number": 37 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is especially useful to create boolean masks for the so-called \"fancy indexing\" of NumPy arrays, which we will come back to later." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "np.isnan(ary)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 5, + "text": [ + "array([[False, False, False, False],\n", + " [False, False, True, False],\n", + " [False, False, False, True]], dtype=bool)" + ] + } + ], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Counting the number of missing values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In order to find out how many elements are missing in our array, we can use the `np.isnan` function that we have seen in the previous section. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "np.count_nonzero(np.isnan(ary))" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 8, + "text": [ + "2" + ] + } + ], + "prompt_number": 8 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we want to determine the number of non-missing elements, we can simply revert the returned `Boolean` mask via the handy \"tilde\" sign." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "np.count_nonzero(~np.isnan(ary))" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 9, + "text": [ + "10" + ] + } + ], + "prompt_number": 9 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Calculating the sum of an array that contains `NaN`s" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As we will find out via the following code snippet, we can't use NumPy's regular `sum` function to calculate the sum of an array." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "np.sum(ary)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 10, + "text": [ + "nan" + ] + } + ], + "prompt_number": 10 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since the `np.sum` function does not work, use `np.nansum` instead:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "print('total sum:', np.nansum(ary))" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "total sum: 62.0\n" + ] + } + ], + "prompt_number": 11 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "print('row sums:', np.nansum(ary, axis=0))" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "row sums: [ 16. 19. 15. 12.]\n" + ] + } + ], + "prompt_number": 12 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "print('column sums:', np.nansum(ary, axis=1))" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "column sums: [ 10. 19. 33.]\n" + ] + } + ], + "prompt_number": 13 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Removing all rows that contain missing values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, we will use the `Boolean mask` again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain `NaN`s, we could simply drop the `~`." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "ary[~np.isnan(ary).any(1)]" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 14, + "text": [ + "array([[ 1., 2., 3., 4.]])" + ] + } + ], + "prompt_number": 14 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Convert missing values to 0" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Certain operations, algorithms, and other analyses might not work with `NaN` objects in our data array. But that's not a problem: We can use the convenient `np.nan_to_num` function will convert it to the value 0." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "ary0 = np.nan_to_num(ary)\n", + "ary0" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 15, + "text": [ + "array([[ 1., 2., 3., 4.],\n", + " [ 5., 6., 0., 8.],\n", + " [ 10., 11., 12., 0.]])" + ] + } + ], + "prompt_number": 15 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Converting certain numbers to NaN" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Vice versa, we can also convert any number to a `np.NaN` object. Here, we use the array that we created in the previous section and convert the `0`s back to `np.nan` objects." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "ary0[ary0==0] = np.nan\n", + "ary0" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 16, + "text": [ + "array([[ 1., 2., 3., 4.],\n", + " [ 5., 6., nan, 8.],\n", + " [ 10., 11., 12., nan]])" + ] + } + ], + "prompt_number": 16 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Remove all missing elements from an array" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is one is a little bit more tricky. We can remove missing values via a combination of the `Boolean` mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array)." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "ary[~np.isnan(ary)]" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 17, + "text": [ + "array([ 1., 2., 3., 4., 5., 6., 8., 10., 11., 12.])" + ] + } + ], + "prompt_number": 17 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Thus, this is a method that would better work on individual rows:" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "x = np.array([1,2,np.nan])\n", + "\n", + "x[~np.isnan(np.array(x))]" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 21, + "text": [ + "array([ 1., 2.])" + ] + } + ], + "prompt_number": 21 + } + ], + "metadata": {} + } + ] +} \ No newline at end of file