mirror of
https://github.com/rasbt/python_reference.git
synced 2024-11-30 15:31:12 +00:00
770 lines
16 KiB
Plaintext
770 lines
16 KiB
Plaintext
{
|
|
"metadata": {
|
|
"name": "",
|
|
"signature": "sha256:b2597ea4263c11dd6774b227e7a3a5626197c4863e6895002657fd55d02b55d9"
|
|
},
|
|
"nbformat": 3,
|
|
"nbformat_minor": 0,
|
|
"worksheets": [
|
|
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to python_reference](https://github.com/rasbt/python_reference)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"%load_ext watermark"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"prompt_number": 1
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"%watermark -v -p numpy -d -u"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"output_type": "stream",
|
|
"stream": "stdout",
|
|
"text": [
|
|
"Last updated: 31/07/2014 \n",
|
|
"\n",
|
|
"CPython 3.4.1\n",
|
|
"IPython 2.1.0\n",
|
|
"\n",
|
|
"numpy 1.8.1\n"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 2
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<font size=\"1.5em\">[More information](https://github.com/rasbt/watermark) about the `watermark` magic command extension.</font>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 1,
|
|
"metadata": {},
|
|
"source": [
|
|
"Quick guide for dealing with missing numbers in NumPy"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This is just a quick overview of how to deal with missing values (i.e., \"NaN\"s for \"Not-a-Number\") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!\n",
|
|
"\n",
|
|
"I would be happy to hear your comments and suggestions. \n",
|
|
"Please feel free to drop me a note via\n",
|
|
"[twitter](https://twitter.com/rasbt), [email](mailto:bluewoodtree@gmail.com), or [google+](https://plus.google.com/+SebastianRaschka).\n",
|
|
"<hr>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Sections"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"- [Sample data from a CSV file](#Sample-data-from-a-CSV-file)\n",
|
|
"- [Determining if a value is missing](#Determining-if-a-value-is-missing)\n",
|
|
"- [Counting the number of missing values](#Counting-the-number-of-missing-values)\n",
|
|
"- [Calculating the sum of an array that contains NaNs](#Calculating the sum of an array that contains NaNs)\n",
|
|
"- [Removing all rows that contain missing values](#Removing-all-rows-that-contain-missing-values)\n",
|
|
"- [Convert missing values to 0](#Convert-missing-values-to-0)\n",
|
|
"- [Converting certain numbers to NaN](#Converting-certain-numbers-to-NaN)\n",
|
|
"- [Remove all missing elements from an array](#Remove-all-missing-elements-from-an-array)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Sample data from a CSV file"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's assume that we have a CSV file with missing elements like the one shown below."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"%%file example.csv\n",
|
|
"1,2,3,4\n",
|
|
"5,6,,8\n",
|
|
"10,11,12,"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"output_type": "stream",
|
|
"stream": "stdout",
|
|
"text": [
|
|
"Writing example.csv\n"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 3
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The `np.genfromtxt` function has a `missing_values` parameters which translates missing values into `np.nan` objects by default. This allows us to construct a new NumPy `ndarray` object, even if elements are missing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"import numpy as np\n",
|
|
"ary = np.genfromtxt('./example.csv', delimiter=',')\n",
|
|
"\n",
|
|
"print('%s x %s array:\\n' %(ary.shape[0], ary.shape[1]))\n",
|
|
"print(ary)"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"output_type": "stream",
|
|
"stream": "stdout",
|
|
"text": [
|
|
"3 x 4 array:\n",
|
|
"\n",
|
|
"[[ 1. 2. 3. 4.]\n",
|
|
" [ 5. 6. nan 8.]\n",
|
|
" [ 10. 11. 12. nan]]\n"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 4
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Determining if a value is missing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A handy function to test whether a value is a `NaN` or not is to use the `np.isnan` function."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"np.isnan(np.nan)"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 5,
|
|
"text": [
|
|
"True"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 5
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"It is especially useful to create boolean masks for the so-called \"fancy indexing\" of NumPy arrays, which we will come back to later."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"np.isnan(ary)"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 6,
|
|
"text": [
|
|
"array([[False, False, False, False],\n",
|
|
" [False, False, True, False],\n",
|
|
" [False, False, False, True]], dtype=bool)"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 6
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Counting the number of missing values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In order to find out how many elements are missing in our array, we can use the `np.isnan` function that we have seen in the previous section. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"np.count_nonzero(np.isnan(ary))"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 7,
|
|
"text": [
|
|
"2"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 7
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"If we want to determine the number of non-missing elements, we can simply revert the returned `Boolean` mask via the handy \"tilde\" sign."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"np.count_nonzero(~np.isnan(ary))"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 8,
|
|
"text": [
|
|
"10"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 8
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Calculating the sum of an array that contains `NaN`s"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As we will find out via the following code snippet, we can't use NumPy's regular `sum` function to calculate the sum of an array."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"np.sum(ary)"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 9,
|
|
"text": [
|
|
"nan"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 9
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Since the `np.sum` function does not work, use `np.nansum` instead:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"print('total sum:', np.nansum(ary))"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"output_type": "stream",
|
|
"stream": "stdout",
|
|
"text": [
|
|
"total sum: 62.0\n"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 10
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"print('column sums:', np.nansum(ary, axis=0))"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"output_type": "stream",
|
|
"stream": "stdout",
|
|
"text": [
|
|
"column sums: [ 16. 19. 15. 12.]\n"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 11
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"print('row sums:', np.nansum(ary, axis=1))"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"output_type": "stream",
|
|
"stream": "stdout",
|
|
"text": [
|
|
"row sums: [ 10. 19. 33.]\n"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 12
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Removing all rows that contain missing values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here, we will use the `Boolean mask` again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain `NaN`s, we could simply drop the `~`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"ary[~np.isnan(ary).any(1)]"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 14,
|
|
"text": [
|
|
"array([[ 1., 2., 3., 4.]])"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 14
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Convert missing values to 0"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Certain operations, algorithms, and other analyses might not work with `NaN` objects in our data array. But that's not a problem: We can use the convenient `np.nan_to_num` function will convert it to the value 0."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"ary0 = np.nan_to_num(ary)\n",
|
|
"ary0"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 15,
|
|
"text": [
|
|
"array([[ 1., 2., 3., 4.],\n",
|
|
" [ 5., 6., 0., 8.],\n",
|
|
" [ 10., 11., 12., 0.]])"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 15
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Converting certain numbers to NaN"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Vice versa, we can also convert any number to a `np.NaN` object. Here, we use the array that we created in the previous section and convert the `0`s back to `np.nan` objects."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"ary0[ary0==0] = np.nan\n",
|
|
"ary0"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 16,
|
|
"text": [
|
|
"array([[ 1., 2., 3., 4.],\n",
|
|
" [ 5., 6., nan, 8.],\n",
|
|
" [ 10., 11., 12., nan]])"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 16
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"<br>\n",
|
|
"<br>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "heading",
|
|
"level": 2,
|
|
"metadata": {},
|
|
"source": [
|
|
"Remove all missing elements from an array"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[[back to top](#Sections)]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This is one is a little bit more tricky. We can remove missing values via a combination of the `Boolean` mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"ary[~np.isnan(ary)]"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 17,
|
|
"text": [
|
|
"array([ 1., 2., 3., 4., 5., 6., 8., 10., 11., 12.])"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 17
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Thus, this is a method that would better work on individual rows:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"collapsed": false,
|
|
"input": [
|
|
"x = np.array([1,2,np.nan])\n",
|
|
"\n",
|
|
"x[~np.isnan(np.array(x))]"
|
|
],
|
|
"language": "python",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"metadata": {},
|
|
"output_type": "pyout",
|
|
"prompt_number": 21,
|
|
"text": [
|
|
"array([ 1., 2.])"
|
|
]
|
|
}
|
|
],
|
|
"prompt_number": 21
|
|
}
|
|
],
|
|
"metadata": {}
|
|
}
|
|
]
|
|
} |