"<font size=\"1.5em\">[More information](https://github.com/rasbt/watermark) about the `watermark` magic command extension.</font>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Quick guide for dealing with missing numbers in NumPy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is just a quick overview of how to deal with missing values (i.e., \"NaN\"s for \"Not-a-Number\") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!\n",
"\n",
"I would be happy to hear your comments and suggestions. \n",
"Please feel free to drop me a note via\n",
"[twitter](https://twitter.com/rasbt), [email](mailto:bluewoodtree@gmail.com), or [google+](https://plus.google.com/+SebastianRaschka).\n",
"<hr>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Sections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Sample data from a CSV file](#Sample-data-from-a-CSV-file)\n",
"- [Determining if a value is missing](#Determining-if-a-value-is-missing)\n",
"- [Counting the number of missing values](#Counting-the-number-of-missing-values)\n",
"- [Calculating the sum of an array that contains NaNs](#Calculating the sum of an array that contains NaNs)\n",
"- [Removing all rows that contain missing values](#Removing-all-rows-that-contain-missing-values)\n",
"- [Convert missing values to 0](#Convert-missing-values-to-0)\n",
"- [Converting certain numbers to NaN](#Converting-certain-numbers-to-NaN)\n",
"- [Remove all missing elements from an array](#Remove-all-missing-elements-from-an-array)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Sample data from a CSV file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's assume that we have a CSV file with missing elements like the one shown below."
"The `np.genfromtxt` function has a `missing_values` parameters which translates missing values into `np.nan` objects by default. This allows us to construct a new NumPy `ndarray` object, even if elements are missing."
"Here, we will use the `Boolean mask` again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain `NaN`s, we could simply drop the `~`."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ary[~np.isnan(ary).any(1)]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"array([[ 1., 2., 3., 4.]])"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Convert missing values to 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Certain operations, algorithms, and other analyses might not work with `NaN` objects in our data array. But that's not a problem: We can use the convenient `np.nan_to_num` function will convert it to the value 0."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ary0 = np.nan_to_num(ary)\n",
"ary0"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": [
"array([[ 1., 2., 3., 4.],\n",
" [ 5., 6., 0., 8.],\n",
" [ 10., 11., 12., 0.]])"
]
}
],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Converting certain numbers to NaN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vice versa, we can also convert any number to a `np.NaN` object. Here, we use the array that we created in the previous section and convert the `0`s back to `np.nan` objects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ary0[ary0==0] = np.nan\n",
"ary0"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": [
"array([[ 1., 2., 3., 4.],\n",
" [ 5., 6., nan, 8.],\n",
" [ 10., 11., 12., nan]])"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Remove all missing elements from an array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top](#Sections)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is one is a little bit more tricky. We can remove missing values via a combination of the `Boolean` mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array)."