python_reference/tutorials/python_data_entry_point.ipynb

1689 lines
319 KiB
Plaintext
Raw Normal View History

2014-06-25 22:06:16 +00:00
{
"metadata": {
"name": "",
2014-07-05 01:13:11 +00:00
"signature": "sha256:7417613f49b14e98fba46fa1e285f4e3d46728b4798e853cfb103caef077b452"
2014-06-25 22:06:16 +00:00
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Sebastian Raschka](http://sebastianraschka.com) \n",
"\n",
2014-06-26 14:41:36 +00:00
"- [Open in IPython nbviewer](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/python_data_entry_point.ipynb?create=1) \n",
2014-06-25 22:06:16 +00:00
"\n",
2014-06-26 14:41:36 +00:00
"- [Link to this IPython notebook on Github](https://github.com/rasbt/python_reference/blob/master/tutorials/python_data_entry_point.ipynb) \n",
2014-06-25 22:06:16 +00:00
"\n",
"- [Link to the GitHub Repository pattern_classification](http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/python_howtos/)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
2014-07-05 01:13:11 +00:00
"%load_ext watermark"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%watermark -a 'Sebastian Raschka' -v -d -p numpy,scipy,matplotlib,scikit-learn"
2014-06-25 22:06:16 +00:00
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
2014-07-05 01:13:11 +00:00
"Sebastian Raschka 04/07/2014 \n",
"\n",
"CPython 3.4.1\n",
"IPython 2.1.0\n",
"\n",
"numpy 1.8.1\n",
"scipy 0.14.0\n",
"matplotlib 1.3.1\n",
"scikit-learn 0.15.0b1\n"
2014-06-25 22:06:16 +00:00
]
}
],
2014-07-05 01:13:11 +00:00
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<font size=\"1.5em\">[More information](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/ipython_magic/watermark.ipynb) about the `watermark` magic command extension.</font>"
]
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"I would be happy to hear your comments and suggestions. \n",
"Please feel free to drop me a note via\n",
"[twitter](https://twitter.com/rasbt), [email](mailto:bluewoodtree@gmail.com), or [google+](https://plus.google.com/+SebastianRaschka).\n",
"<hr>"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Entry point: Data "
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"- Using Python's sci-packages to prepare data for Machine Learning tasks and other data analyses"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this short tutorial I want to provide a short overview of some of my favorite Python tools for common procedures as entry points for general pattern classification and machine learning tasks, and various other data analyses. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Sections"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- [Installing Python packages](#Installing-Python-packages)\n",
"\n",
"- [About the dataset](#About-the-dataset)\n",
"\n",
"- [Downloading and saving CSV data files from the web](#Downloading-and-savin-CSV-data-files-from-the-web)\n",
"\n",
"- [Reading in a dataset from a CSV file](#Reading-in-a-dataset-from-a-CSV-file)\n",
"\n",
2014-06-26 04:24:05 +00:00
"- [Visualizating of a dataset](#Visualizating-of-a-data)\n",
2014-06-25 22:06:16 +00:00
"\n",
" - [Histograms](#Histograms)\n",
"\n",
" - [Scatterplots](#Scatterplots)\n",
"\n",
"- [Splitting into training and test dataset](#Splitting-into-training-and-test-dataset)\n",
"\n",
"- [Feature Scaling](#Feature-Scaling)\n",
"\n",
"- [Linear Transformation: Principal Component Analysis (PCA)](#PCA)\n",
"\n",
2014-07-02 14:32:15 +00:00
"- [Linear Transformation: Linear Discrciminant Analysis (LDA)](#MDA)\n",
"\n",
"- [Simple Supervised Classification](#Simple-Supervised-Classification)\n",
"\n",
" - [Linear Discriminant Analysis as simple linear classifier](#Linear-Discriminant-Analysis-as-simple-linear-classifier)\n",
" \n",
" - [Classification Stochastic Gradient Descent (SGD)](#SGD)\n",
"\n",
2014-06-25 22:06:16 +00:00
"- [Saving the processed datasets](#Saving-the-processed-datasets)\n",
"\n",
" - [Pickle](#Pickle)\n",
"\n",
" - [Comma Separated Values (CSV)](#Comma-Separated-Values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Installing Python packages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**In this section want to recommend a way for installing the required Python-packages packages if you have not done so, yet. Otherwise you can skip this part.**\n",
"\n",
"The packages we will be using in this tutorial are:\n",
"\n",
"- [NumPy](http://www.numpy.org)\n",
"- [SciPy](http://www.scipy.org)\n",
"- [matplotlib](http://matplotlib.org)\n",
"- [scikit-learn](http://scikit-learn.org/stable/)\n",
"\n",
"Although they can be installed step-by-step \"manually\", but I highly recommend you to take a look at the [Anaconda](https://store.continuum.io/cshop/anaconda/) Python distribution for scientific computing.\n",
"\n",
"Anaconda is distributed by Continuum Analytics, but it is completely free and includes more than 195+ packages for science and data analysis as of today.\n",
"The installation procedure is nicely summarized here: http://docs.continuum.io/anaconda/install.html\n",
"\n",
"If this is too much, the [Miniconda](http://conda.pydata.org/miniconda.html) might be right for you. Miniconda is basically just a Python distribution with the Conda package manager, which let's us install a list of Python packages into a specified `conda` environment from the Shell terminal, e.g.,\n",
"\n",
"<pre>$[bash]> conda create -n myenv python=3\n",
"$[bash]> source activate myenv\n",
"$[bash]> conda install -n myenv numpy scipy matplotlib scikit-learn</pre>\n",
"\n",
"When we start \"python\" in your current shell session now, it will use the Python distribution in the virtual environment \"myenv\" that we have just created. To un-attach the virtual environment, you can just use\n",
"<pre>$[bash]> source deactivate myenv</pre>\n",
"\n",
"**Note:** environments will be created in ROOT_DIR/envs by default, you can use the `-p` instead of the `-n` flag in the conda commands above in order to specify a custom path.\n",
"\n",
"**I find this procedure very convenient, especially if you are working with different distributions and versions of Python with different modules and packages installed and it is extremely useful for testing your own modules.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"About the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the following tutorial, we will be working with the free \"Wine\" Dataset that is deposited on the UCI machine learning repository \n",
"(http://archive.ics.uci.edu/ml/datasets/Wine).\n",
2014-06-25 22:06:16 +00:00
"\n",
"<br>\n",
"\n",
"<font size=\"1\">\n",
2014-07-02 11:39:26 +00:00
"**Reference:** \n",
2014-06-25 22:06:16 +00:00
"Forina, M. et al, PARVUS - An Extendible Package for Data\n",
"Exploration, Classification and Correlation. Institute of Pharmaceutical\n",
"and Food Analysis and Technologies, Via Brigata Salerno, \n",
2014-07-02 11:39:26 +00:00
"16147 Genoa, Italy.\n",
"\n",
"Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.\n",
"\n",
"</font>"
2014-06-25 22:06:16 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The Wine dataset consists of 3 different classes where each row correspond to a particular wine sample.\n",
"\n",
"The class labels (1, 2, 3) are listed in the first column, and the columns 2-14 correspond to the following 13 attributes (features):\n",
"\n",
"1) Alcohol \n",
"2) Malic acid \n",
"3) Ash \n",
"4) Alcalinity of ash \n",
"5) Magnesium \n",
"6) Total phenols \n",
"7) Flavanoids \n",
"8) Nonflavanoid phenols \n",
"9) Proanthocyanins \n",
"10) Color intensity \n",
"11) Hue \n",
"12) OD280/OD315 of diluted wines \n",
"13) Proline \n",
"\n",
"An excerpt from the wine_data.csv dataset:\n",
" \n",
"<pre>1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065\n",
"1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050\n",
"[...]\n",
"2,12.37,.94,1.36,10.6,88,1.98,.57,.28,.42,1.95,1.05,1.82,520\n",
"2,12.33,1.1,2.28,16,101,2.05,1.09,.63,.41,3.27,1.25,1.67,680\n",
"[...]\n",
"3,12.86,1.35,2.32,18,122,1.51,1.25,.21,.94,4.1,.76,1.29,630\n",
"3,12.88,2.99,2.4,20,104,1.3,1.22,.24,.83,5.4,.74,1.42,530</pre>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Downloading and saving CSV data files from the web"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Usually, we have our data stored locally on our disk in as a common text (or CSV) file with comma-, tab-, or whitespace-separated rows. Below is just an example for how you can CSV datafile from a HTML website directly into Python and optionally save it locally."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import csv\n",
"import urllib\n",
"\n",
"url = 'https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/wine_data.csv'\n",
"csv_cont = urllib.request.urlopen(url)\n",
"csv_cont = csv_cont.read() #.decode('utf-8')\n",
"\n",
"# Optional: saving the data to your local drive\n",
"with open('./wine_data.csv', 'wb') as out:\n",
" out.write(csv_cont)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 2
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If you'd rather like to work with the data directly in `str`ing format, you could just apply the `.decode('utf-8')` method to the data that was read in byte-format by default.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Reading in a dataset from a CSV file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since it is quite typical to have the input data stored locally, as mentioned above, we will use the [`numpy.loadtxt`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) function now to read in the data from the CSV file. \n",
"(alternatively [`np.genfromtxt()`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html) could be used in similar way, it provides some additional options)"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"\n",
"# reading in all data into a NumPy array\n",
"all_data = np.loadtxt(open(\"./wine_data.csv\",\"r\"),\n",
" delimiter=\",\", \n",
" skiprows=0, \n",
" dtype=np.float64\n",
" )\n",
"\n",
"# load class labels from column 1\n",
"y_wine = all_data[:,0]\n",
"\n",
"# conversion of the class labels to integer-type array\n",
"y_wine = y_wine.astype(np.int64, copy=False)\n",
"\n",
"# load the 14 features\n",
"X_wine = all_data[:,1:]\n",
"\n",
"# printing some general information about the data\n",
"print('\\ntotal number of samples (rows):', X_wine.shape[0])\n",
"print('total number of features (columns):', X_wine.shape[1])\n",
"\n",
"# printing the 1st wine sample\n",
"float_formatter = lambda x: '{:.2f}'.format(x)\n",
"np.set_printoptions(formatter={'float_kind':float_formatter})\n",
"print('\\n1st sample (i.e., 1st row):\\nClass label: {:d}\\n{:}\\n'\n",
" .format(int(y_wine[0]), X_wine[0]))\n",
"\n",
"# printing the rel.frequency of the class labels\n",
"print('Class label frequencies')\n",
"print('Class 1 samples: {:.2%}'.format(list(y_wine).count(1)/y_wine.shape[0]))\n",
"print('Class 2 samples: {:.2%}'.format(list(y_wine).count(2)/y_wine.shape[0]))\n",
"print('Class 3 samples: {:.2%}'.format(list(y_wine).count(3)/y_wine.shape[0]))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"total number of samples (rows): 178\n",
"total number of features (columns): 13\n",
"\n",
"1st sample (i.e., 1st row):\n",
"Class label: 1\n",
"[14.23 1.71 2.43 15.60 127.00 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.00]\n",
"\n",
"Class label frequencies\n",
"Class 1 samples: 33.15%\n",
"Class 2 samples: 39.89%\n",
"Class 3 samples: 26.97%\n"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 3
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
2014-06-26 04:24:05 +00:00
"Visualizating of a dataset"
2014-06-25 22:06:16 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are endless way to visualize datasets for get an initial idea of how the data looks like. The most common ones are probably histograms and scatter plots."
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Histograms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Histograms are a useful data to explore the distribution of each feature across the different classes. This could provide us with intuitive insights which features have a good and not-so-good inter-class separation. Below, we will plot a sample histogram for the \"Alcohol content\" feature for the three wine classes."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%matplotlib inline"
],
"language": "python",
"metadata": {},
"outputs": [],
2014-06-27 02:20:15 +00:00
"prompt_number": 4
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from matplotlib import pyplot as plt\n",
"from math import floor, ceil # for rounding up and down\n",
"\n",
"plt.figure(figsize=(10,8))\n",
"\n",
2014-06-26 12:58:21 +00:00
"# bin width of the histogram in steps of 0.15\n",
"bins = np.arange(floor(min(X_wine[:,0])), ceil(max(X_wine[:,0])), 0.15)\n",
2014-06-25 22:06:16 +00:00
"\n",
"# get the max count for a particular bin for all classes combined\n",
"max_bin = max(np.histogram(X_wine[:,0], bins=bins)[0])\n",
"\n",
"# the order of the colors for each histogram\n",
"colors = ('blue', 'red', 'green')\n",
"\n",
"for label,color in zip(\n",
" range(1,4), colors):\n",
"\n",
" mean = np.mean(X_wine[:,0][y_wine == label]) # class sample mean\n",
" stdev = np.std(X_wine[:,0][y_wine == label]) # class standard deviation\n",
" plt.hist(X_wine[:,0][y_wine == label], \n",
" bins=bins, \n",
" alpha=0.3, # opacity level\n",
" label='class {} ($\\mu={:.2f}$, $\\sigma={:.2f}$)'.format(label, mean, stdev), \n",
" color=color)\n",
"\n",
"plt.ylim([0, max_bin*1.3])\n",
"plt.title('Wine data set - Distribution of alocohol contents')\n",
"plt.xlabel('alcohol by volume', fontsize=14)\n",
"plt.ylabel('count', fontsize=14)\n",
"plt.legend(loc='upper right')\n",
"\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
2014-06-27 02:20:15 +00:00
"png": "iVBORw0KGgoAAAANSUhEUgAAAmgAAAH8CAYAAABl8FOBAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xt4lPWd///nhFPKSWJSQKOAgCfwANIKiJVYrKgUraxo\n8YBs3Srqr3zXw5dK6SrUA1irq19166pVCtha4roIS5VWDtJYuxRQlIMgUAggtSEYOQiJQH5/3JNh\nEmbCBCbkJnk+risXc58+877vucO88vnc9wxIkiRJkiRJkiRJkiRJkiRJkiRJkiRJkiRJkuqRbwEf\nH6Xnmg/ccpSeK+yWARelqa0bgNlx0/uBzmlqG2AH0CmN7aXia8BMoAT43WFsn+5jUNUk4MHD3HY+\n/h5IhyWjrguQjsAY4PdV5n2SZN61wJ+AM45CXQDl0Z9UrAe+XXulJHWkb+ydom3siP78nSBoXFJl\nvbOABSm2daj/k14BBtawzmTmc3B4aEXwehxN1wBtgeOB647yc6eiJudyOrc9HJ1I7TxK1QiC/zek\no86ApmPZO8AFQCQ6fQLQGOjBgXP7BKALhw4IdamcA/twtKXjeY8jCDbnAH8E/hu4uRbqaXSYbSZz\nNINDdToCqwmChdKjrn6fJElAU2AX0DM6fS3wEkHPyHlx81ZHH+cBG+O2Xw/cAywlGF56FWgWt/y7\nwAfA58C7wNnV1PIdguHTEuBpKvfOdAHmAluBImAqQagBmALsA74k6IW6Nzo/H9gSbe8doFs1zz0C\nWAtsB9YB18ct+wGwAtgGvAV0iM5fQBAIdkafd2g17SfTicS9FfcQ9KZVWM+BHsLzgUXAF9F1fhGd\nX8iB3rjtQJ/ofr0LPEFw7B7k4B6N/cCPCPa/CPg5B96cxxEc36r1NgIeBvYCu6PP+f/i2qvoVTwO\nmAz8I7oPY+PaHgEUAI8RHNt1wGUkdybBOfE5wZDv4Oj88UApUBat458TbHs+8F50208Jzq8mVY5B\nKjUD/JDgfNgOLOfA706y+gBeBp4B/ie63V+o3PN6AfBXgnN1IdA3btk8gnMwkQzgJ8CaaLuLgJNS\naHM+8DOC47+dYMg7O7os/jzaAfSOzk/2e0B0/dsI/p/4PLqvFcdkN8F5siO6LcAVBMduO7CJ4HyX\nJFUxF/jX6ONnCN7gHqoy78Xo4zwqB7S/EbzZtAeyCP4Dvy26rCfwGfBNgje44dH1myaoIYfgP+sh\nBG/+/wp8xYE3pi7AAII31RyCwPXvVeqoOsQ5AmgR3ebfgfcT7z4tCMLOqdHpdhwIc1cRDO+eTvBm\nOJYg8FRI1xBn1YDWOTr/9Oh0/P69R3AdGUBzDryBdkzQ1giC43hndH4miQPaHKANcDKwigPB+AES\nB7SK50gUHuKPyWSC3sAW0fpWxa0/giBU3UJwfowENpNYE4IQch9BD+/FBOfLaXF1Tk6yLQR/bJwf\nrbsjwXn6fw6j5qEEgaJXdLoLQVA5VH2TCALyNwjO76nAb6PLjicINTdE6/s+QZDJii6vLqD9X+BD\nDpy7Z0fbO1Sb8wnO664E58Q8YEJ0WaLzKJXfgxlAa4Jz6B8cGEa/mYOHOLcA/aKPj+NAyJUkxXkA\neD36+AOCN52BcfOWAjdFH+dxcECL7216FPhl9PEvCf5Kj/cxiS92Hw78ucq8jSR/Y/oesKRKHdVd\ng9aG4E2kVYJlLQjezIYQXGwe780qNWQQ9DieHJ2urYCWGZ1f0esRv3/vEPRs5aTQ1ghgQ5X1RnBw\nQLs0bvp24O3o43EcOqBVvQat4pg0IujZir9m8dboNhV1fBK3rHl027Yc7FsEb+rxfkNw7iaq81D+\nlQPnd01qnk3Q21jT+iYBz8ctuxxYGX18E8EfOfH+zIEh7uoC2sdU7qmrkEqbP4lbdjvBuQ6Jz6NU\nfg8uiFv+O+DH0ccjODigbSA4rq0T1C6ljdeg6Vi3ALiQ4K/rrxMMdb1H8B9uFtCd6q8/ix+K2w20\njD7uSDB08Xncz0kE17RVdSJBz0S8+CDYjmD4dBNBb9cUDgzJJJIBTCTo1fiCIOCUc3CogeCN5jqC\nHpxPCYahKnquOgJPxdVfHJ2fW81zx1vOgaGifodYN15F+9sSLLuFoGdmJcHQ1aBDtLXxEMurrlNI\n8HqkKtl1aDkEPUvxAbGQyscu/tz5MvpvSw52IgfvxwZSfx1OI3hdtxCcDw+T+Pw5VM0nEfx+pFpf\nxXEsJ+hNrhD/e3Ji9DmSbVudk6up51BtJvu9TSSV34Oqr2WLatr7J4JhzvUEvXl9qllXOmwGNB3r\n/kIwzPBDDgxbbCcIK7dG/63aC1OdijfsQoI3wqy4n5Yk/hiETznw1zgEQ17x048QXGd2VrTWm6j8\nu1c1JNwAXEkwLHoccEq0zWQXPv+BoBepPUGvxAtx+3BrlX1owcG9E8l0J+i1a0XlIaFDuZrgDX1V\ngmVrCHotv07QY/kaQc9fsqCUyoX8Hao8rhhq3EXQs1WhfQ3a3kowvNqpSttVg3gqKs6P+NevYw3a\n+iXBsGZXgvNhLIn/7z5UzRujbaRaX7Ih23ibo+vGS3XbZPUcSZuJXtMj+T1I1N4igl7wrwPTgWkp\ntCPVmAFNx7rdBP9h3k3lnrKC6Lx3athexZvUCwS9UudH57Ug6O1J9Jf6LIIwczXBNTyjqBwGWhKE\nhe0Ef7X/3yrbf0YwNBu/filBD1QLgoCXTFuCa2xaELw57yIIgwDPEQwFVVyTdhyVbwao+ryHq+KY\ntQP+P+B+go9ASeRGgjc2CHqDygmGmIqi/x5OPfdy4Bq0URwI0e8TDEmfTLDvVWuqbv/3EbzxPkzw\nenQE7iK4/qqm/kLQKzOaoIcrj+AGlFdT3L4lQS/mlwTDl7cfZs0vEhyr8whes64EAe5Q9VV3R+Sb\nBD18wwjO/euiNf5P3DrJtn+R4MaPrtF1ziG4/uz3R9BmovPoUL8HVcX/MfQZQc9jxU0ZTQj+gDqO\n4Hjv4MDvmySpiooeqh5x84ZG5/0wbl4elYdOql77VfVi7YEEw3AVd8/9juRDKQMJeowq7uKMv/am\nG0GI3EFw7dndVeq4kqCX7/PoshYEf5lvj9Z4U3RfEl0v1p5gmKUkuv1cKl+DdCPBhdhfRJ/zxbhl\nt0X363OCz+KqqU4cuGNuJ8Gb2f9Q+ZowqHycp0TX2wF8RLDvFcYTXKC9jeDmgZs5eHi66rz9BKFw\nLUEP0mNU/sPzGYL9Ww38C8FxrFjeh+A12wY8GddexXFuE633HwTH7qcceONOVFuy1wiCc2A+weu0\njCBUVzjUTQLfIhgS3hF9zvFVnjv+eaurGYLX/ONoWx8C56ZQ38tUvh4zj8rnbz+C87uE4M7L+Ou5\nDnUX51iCO2C3A//LgWHMmrRZ9bWoOI8+J/gDC6r/Paj6usXvbxOCc7o42mYTglC6LdrW/1apTZIk\nSZIkSZIkSZIkSZIkqc4dE99Xdu6555YvXbq0rsuQJElKxVIq37hWY8fEx2wsXbqU8vJyf6r8PPDA\nA3VeQxh/PC4eE4+Lx8Xj4jGpyx8O3CF92I6JgCZJktSQGNAkSZJCxoB2DMvLy6vrEkLJ43Iwj0li\nHpfEPC6JeVwO5jGpPcfETQJAeXRMV5IkKdQikQgcYcZqnJ5SJEkNyfHHH8/nn39e12VIdSorK4tt\n27bVStv2oEmSaiwSieD/y2rokv0epKMHzWvQJEmSQsaAJkmSFDIGNEmSpJAxoEmSJIWMAU2SJClk\nDGiSJEkhY0CTJDUonTp1Ys6cOXVdRlJjxozhqaeequsyBPTu3ZsVK1bUyXP7QbWSpLTIz59NcXFZ\nrbWfnd2UoUMHHnE7kUik4nOqat0zzzzDpEmTWLZsGcOGDePll1+udv2ioiKmTJnC2rVra62mDz74\ngKlTp/KLX/wiNu+NN95g586drF27lpycHO64445K2+zfv5+srCwyMg7063znO99h2rRpsemFCxcy\nZ84cxowZU2u1JzN9+nRWrFhB
2014-06-25 22:06:16 +00:00
"text": [
2014-06-27 02:20:15 +00:00
"<matplotlib.figure.Figure at 0x1056b3b00>"
2014-06-25 22:06:16 +00:00
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 5
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Scatterplots"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scatter plots are useful for visualizing features in more than just one dimension, for example to get a feeling for the correlation between particular features. \n",
"Unfortunately, we can't plot all 13 features here at once, since the visual cortex of us humans is limited to a maximum of three dimensions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we will create an example 2D-Scatter plot from the features \"Alcohol content\" and \"Malic acid content\". \n",
"Additionally, we will use the [`scipy.stats.pearsonr`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html) function to calculate a Pearson correlation coefficient between these two features.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from scipy.stats import pearsonr\n",
"\n",
"plt.figure(figsize=(10,8))\n",
"\n",
"for label,marker,color in zip(\n",
" range(1,4),('x', 'o', '^'),('blue', 'red', 'green')):\n",
"\n",
" # Calculate Pearson correlation coefficient\n",
" R = pearsonr(X_wine[:,0][y_wine == label], X_wine[:,1][y_wine == label])\n",
" plt.scatter(x=X_wine[:,0][y_wine == label], # x-axis: feat. from col. 1\n",
" y=X_wine[:,1][y_wine == label], # y-axis: feat. from col. 2\n",
" marker=marker, # data point symbol for the scatter plot\n",
" color=color,\n",
" alpha=0.7, \n",
" label='class {:}, R={:.2f}'.format(label, R[0]) # label for the legend\n",
" )\n",
" \n",
"plt.title('Wine Dataset')\n",
"plt.xlabel('alcohol by volume in percent')\n",
"plt.ylabel('malic acid in g/l')\n",
"plt.legend(loc='upper right')\n",
"\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
2014-06-27 02:20:15 +00:00
"png": "iVBORw0KGgoAAAANSUhEUgAAAloAAAH4CAYAAACSZ0OSAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd8VfX9x/HXvdnjZhHDkgAWUZZKRSgKNVZ/FWdxoCJD\n1F9ba7WoiFAriqIdgtYfoJQKMmTJUlRcdaCMypA9NMgKJJA9bnZy7/n9cZKQSBJuQm5Oxvv5eNxH\n7tmfGyN55/v9nu8BEREREREREREREREREREREREREREREREREREREREREREREWnmBgPfW12EiIiI\nSHPwZ+Cjn6w7WMO6u7xcixvIBZxAGvB5Ha8ZBxxv+LIsu46IiIg0c1cCWYCtbLk9cARIAuyV1rmB\ndl6uxQ1cUPY+ChgJpADPenh8HApaIiIi0oT4A3lA37Llu4C3gHXAzyutiy97H0fVkHEUGAfswgxs\ny4CASttvBnYCmcBGoE8ttVQOWuXuAAqAyLLl+4H9QA5wCPhd2fqQsv1cmC1iOZjBsD/w37LrJwEz\nAL9K5/8nkAxkA7uBXmXrA4BpwDHgFDALCKzlOiIiIiLV+hJ4rOz9TMww8+JP1s0pex9H1aB1BPgW\nM2xEYoag35dt64sZYq7AbDEbXba/fw11VBe0/IAS4Pqy5RuBrmXvf0nVkHg1Z7Y0/RwzbNmBzmX1\njS3bdj2wDQgrW76I06Hpn8B7QAQQCrwP/LWW64iIiIhU6zlgddn7ncDPMENI+bpdwKiy93GcGbTu\nrbT8D8zWH8q+vvCTa32PGZCqU13QAjgJDK/hmHeBP9VQW3Ue4/Tn+hXwAzCA092kYIbC3J/UMhA4\nXIfriEgLYz/7LiIi1foGGITZInUeZpfcfzHHb0Vidqd9U8vxpyq9L8BsAQKzBWkcZrdd+et8zDFf\nnvIrqymjbPkGzBa09LLz3Qi0qeX47sCHmGEtG3ip0v5fYrbWvY7Z8jYbcJRdLxj4rlLdHwPRdahb\nRFoYBS0Rqa9vgXDgt5jjqMAce5SEOQYqCXOskqeMsq8JmMEmstIrFHinDuf6DVAKbMEcN7UKeBmI\nKTvfR5weyG9Uc/wszO7Cbpif8S9U/fdyBtAP6IkZysYDqZiBsWeluiM43cVY3XVEpIVT0BKR+irA\nHKv0BFVbrjaUrfu6jucrDz5vAg9hjpGyYQ4kv4nTLV61HRsFjMBscfo7ZquSf9krDbOb8Qbg15WO\nTcZsrQqrtC4Uc9B6PnAx8AdOB6V+mN2GfmXbCzEHuRtltb+G2boF0LHStaq7joiIiEiN/ooZMi6r\ntG5Y2brfVloXh9lSVe4I5lincs8BCystX4/ZGlV+19871By0Ks+jlQ58Adzzk30exuyqzCy7zhKq\njgObixnEMjAHtg8GDpSd8xvgeU6HyV9hjj9zYrZivY3ZZQhm69lLmN2o2ZitYo/Uch0RkXNyEbCj\n0iub0wNQRURERKSB2DEHlnayuhARERGRlubXmGM3RERERFqFxhwMfw/muAgRERGRVsF29l0ahD+Q\niHnbc2r5yp/97GfGoUOHGqkEERERkXNyCHPaF481VovWDZiT+KVWXnno0CEMw9CrEV/PPfec5TW0\ntpe+5/qet4aXvuf6nreGF+YTMOqksYLWcGBpI11LREREpElojKAVAlzH6eeEiYiIiLQKvo1wjTz0\nrK8mIy4uzuoSWh19zxufvueNT9/zxqfvefPQWIPha2KU9XmKiIiINGk2mw3qmJ0ao0VLRETEUlFR\nUWRmZlpdhjQTkZGRZGRkNMi51KIlIiItns1mQ79vxFM1/bzUp0WrMScsFREREWlVFLREREREvERB\nS0RERMRLFLREREREvERBS0REpAmaP38+gwcPtroMOUcKWiIiIsLMmTPp168fgYGB3H///XU6dsyY\nMQQEBOBwOIiKiuLaa69l3759DVLX0aNHueaaawgJCaFHjx588cUXte4/YcIEoqOjiY6OZuLEidXu\n8/XXX2O325k0aVKD1FgbBS0REZFquN2wfj2U3+XvdMJ331lbkzd17NiRSZMm8cADD9T5WJvNxoQJ\nE3A6nSQlJREbG1vnsFaT4cOHc/nll5ORkcFLL73EnXfeSVpaWrX7zp49mzVr1rB79252797NBx98\nwOzZs6vsU1JSwtixY/nFL35RPl2DVyloiYhIq5ScDMeOnV4+cADy8k4vFxTAqlUwf74ZsiZNgr17\nq57jp1Mt1WeqruPHj3P77bcTExNDdHQ0jz76aLX7jR07ltjYWMLDw+nXrx8bNmyo2LZlyxb69etH\neHg47dq1Y9y4cQAUFhYycuRIoqOjiYyMpH///qSkpFR7/ttuu43f/OY3tGnTpu4fopLAwECGDRvW\nIC1a8fHx7Nixg+eff56AgABuv/12LrnkElatWlXt/gsWLODJJ5+kQ4cOdOjQgSeffJL58+dX2eeV\nV15hyJAhXHTRRY0yt5qCloiItEqHD5vh6dgx2LkTXnoJEhNPbw8JgSlTYMMGuPdeuPRSGD266jne\nfBPKe7Kys+HppyE93fMaXC4XN998M127duXYsWMkJiYyfPjwavft378/u3btIjMzk3vvvZdhw4ZR\nXFwMmCHs8ccfJzs7m8OHD3P33XcDZvDIycnhxIkTZGRkMHv2bIKCgmqtqb7ho/y4vLw8li5dyoAB\nAyq2bdiwgcjIyBpfmzZtqvac+/bt44ILLiAkJKRi3aWXXlpjiNu/fz+XXnppxfIll1xSZd9jx44x\nb948Jk2a1GgT2OoRPCIi0ioNHAglJfDII+by3/8O3bvX7Rw33QR/+YsZsr78En7xC4iK8vz4LVu2\ncPLkSaZOnYrdbrZ9XHnlldXuO2LEiIr3TzzxBC+++CI//PADffr0wd/fn4MHD5KWlkZ0dDT9+/cH\nwN/fn/T0dA4ePEifPn3o27fvWWuqT3eaYRhMmzaNmTNnkpOTQ5cuXdi8eXPF9kGDBtXrEUi5ubmE\nh4dXWRcWFkZi5URcy/5hYWHk5uZWLP/pT3/ixRdfJCQkBJvNpq5DERERbwoLO/0+NLTqtrw8s8Vr\n0CBYsgR27YKFC6vu07EjPPUUzJtn7j9iBNTld/fx48fp3LlzRciqzbRp0+jZsycRERFERkaSnZ1d\nMVZp7ty5xMfH06NHD/r378/atWsBGDVqFNdffz333HMPHTt2ZMKECZSWltZ6nfq09NhsNsaPH09m\nZiZHjx4lICCAhT/9ZnmgV69eOBwOwsLC2LhxIw6Hg5ycnCr7ZGVlEVb5P1wloaGhVfbPzs4mtOw/\n7AcffEBubi7Dhg0DzM+prkMREREv2bULpk0zW7LGjzdDVULC6e1BQXDHHTBmDDgcZjdi795Vz5Gd\nDW+8YYYxwzBbteqiU6dOJCQk4HK5at1v/fr1TJ06lRUrVpCVlUVmZibh4eEVQaFbt24sWbKE1NRU\nJkyYwJ133klBQQG+vr48++yz7Nu3j02bNvHhhx+eNQDVt5WnvJZOnToxffp0pkyZUhF61q9fj8Ph\nqPG1ceNGwOwqdDqd5OTkcNVVV9GzZ08OHz5cpVVq165d9OrVq9oaevXqxc6dO6vs27vsP9qXX37J\ntm3baN++Pe3bt2f58uW89tpr3HbbbfX6vJ5S0BIRkVYpPNwcU9WrF/zyl/DHP5rjssrZ7TB48OkW\nKocDLr+86jkWLjS7C596yhzjtWxZ3cZoDRgwgPbt2zNx4kTy8/MpLCysdryS0+nE19eX6OhoiouL\neeGFF6q03CxatIjU1NSyzxWOzWbDbrfz1VdfsWfPHlwuFw6HAz8/P3x8fKqtxeVyUVhYSGlpKS6X\ni6KioioB0G63880331R77E9bhq677jq6devGrFmzABg8eDBOp7PG11VXXVXtebt3785ll13G888/\nT2FhIatXr2bv3r3ccccd1e4/evRoXn31VZKSkkhMTOTVV19lzJgxAEyZMoWDBw+ya9cudu7cya23\n3srvfvc75s2bV+25GoqCloiI
2014-06-25 22:06:16 +00:00
"text": [
2014-06-27 02:20:15 +00:00
"<matplotlib.figure.Figure at 0x1059cba58>"
2014-06-25 22:06:16 +00:00
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 6
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want to pack 3 different features into one scatter plot at once, we can also do the same thing in 3D:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from mpl_toolkits.mplot3d import Axes3D\n",
"\n",
"fig = plt.figure(figsize=(8,8))\n",
"ax = fig.add_subplot(111, projection='3d')\n",
" \n",
"for label,marker,color in zip(\n",
" range(1,4),('x', 'o', '^'),('blue','red','green')):\n",
" \n",
" ax.scatter(X_wine[:,0][y_wine == label], \n",
" X_wine[:,1][y_wine == label], \n",
" X_wine[:,2][y_wine == label], \n",
" marker=marker, \n",
" color=color, \n",
" s=40, \n",
" alpha=0.7,\n",
" label='class {}'.format(label))\n",
"\n",
"ax.set_xlabel('alcohol by volume in percent')\n",
"ax.set_ylabel('malic acid in g/l')\n",
"ax.set_zlabel('ash content in g/l')\n",
"\n",
"plt.title('Wine dataset')\n",
" \n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
2014-06-27 02:20:15 +00:00
"png": "iVBORw0KGgoAAAANSUhEUgAAAcwAAAHMCAYAAABY25iGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzsvXmUVOWd//++t/a9u4GmbZoGBAkIyiIKImvUqLjGNWZQ\nJzHKccwYRyff4zITz/klmWRmgsZ8Z8YlUb/GmBhHR8ERTVBBFkUkCCJEoJutm26g1+rabtVdnt8f\n5XO5VV1VXcutqltVz+ucPgdqufXcqnuf9/P5PJ8FYDAYDAaDwWAwGAwGg8FgMBgMBoPBYDAYDAaD\nwWAwGAwGg8FgMKoNLtOThBBSqoEwGAwGg2EEOI5LqY18qQfCYDAYDEYlwgSTwWAwGIwsYILJYDAY\nDEYWMMFkMBgMBiMLmGAyGAwGg5EFTDAZDAaDwcgCJpgMBoPBYGQBE0wGg8FgMLKACSaDwWAwGFnA\nBJPBYDAYjCxggslgMBgMRhYwwWQwGAwGIwuYYDIYDAaDkQVMMBkMBoPByAImmAwGg8FgZAETTAaD\nwWAwsoAJJoPBYDAYWcAEk8FgMBiMLGCCyWAwGAxGFjDBZDAYDAYjC5hgMhgMBoORBUwwGYws2bx5\nM6ZNm1aSz1q2bBmee+65knwWg8HIDiaYjJrlZz/7GVasWJHw2FlnnZXysVdffRWLFy/Gl19+WZKx\ncRwHjuOyeu3EiRPxwQcfFHlEpfscBsOoMMFk1CxLly7FRx99BEIIAKC7uxuSJGHXrl1QFEV9rL29\nHUuWLCnnUDPCcZx6DtXwOQyGUWGCyahZ5s2bB1EUsWvXLgBxl+vy5csxderUhMemTJmCpqYmbNy4\nEePHj1ffP3HiRKxevRqzZs1CXV0dvvWtbyEajarP/+///i9mz56N+vp6XHTRRdizZ0/asaxfvx7T\npk1DXV0d/v7v/x6EEFWc2tvb8fWvfx2jR4/GmDFjsHLlSvj9fgDAbbfdhmPHjuHqq6+Gx+PBL37x\nCwDATTfdhDPOOAN1dXVYunQp9u3bp37WunXrMGPGDHi9XrS0tGD16tUjjjnd5zAYjK8gDEaVs3z5\ncvLEE08QQgi59957yfPPP08effTRhMfuvPNOQgghGzZsIC0tLep7J06cSObPn0+6u7tJf38/mT59\nOnn66acJIYTs3LmTNDY2ku3btxNFUciLL75IJk6cSKLR6LAx9PT0EI/HQ15//XUiSRJ54okniNls\nJs899xwhhJC2tjby3nvvkVgsRnp6esiSJUvI/fffnzCO999/P+GYL7zwAgkGgyQWi5H777+fzJ49\nW32uqamJbNmyhRBCyODgINm5c2fGMcdisbSfw2BUI+k0kVmYjJpm6dKl2LRpEwBgy5YtWLJkCRYv\nXqw+tnnzZixdujTt+++77z40NTWhvr4eV199tWqZPvvss1i1ahXOP/98cByH22+/HTabDdu2bRt2\njHXr1mHmzJm4/vrrYTKZcP/996OpqUl9fvLkybj44othsVgwevRo/MM//AM+/PDDjOf1t3/7t3C5\nXLBYLHjsscewe/duBAIBAIDVasXevXsxNDQEn8+HOXPm5DxmBqMWYYLJqGmWLFmCLVu2YGBgAD09\nPZg8eTIuvPBCfPTRRxgYGMDevXsz7l9qhc3hcCAYDAIAjh49itWrV6O+vl796+zsRHd397BjdHV1\noaWlJeExrev35MmT+Na3voWWlhb4fD7cdttt6OvrSzsmRVHw0EMPYcqUKfD5fJg0aRI4jkNvby8A\n4PXXX8e6deswceJELFu2TBXEdGPu6urK4ptkMKofJpiMmoQQglgshunTp8Pv9+OZZ57BRRddBADw\ner1obm7Gs88+i+bmZkyYMCHr49LI1tbWVjz66KMYGBhQ/4LBIG655ZZh72lubkZHR0fC2LT/f+SR\nR2AymfDFF1/A7/fjpZdeUoOStJ9Jefnll7F27Vq8//778Pv9OHz4cMKe6Lx58/Dmm2+ip6cH1113\nHW6++easxpxt1C6DUa0wwWTUHIqiIBqNQhRF2Gw2zJ49G0888QTOO+88DAwMQBAELFy4EI8//nhG\nd2wqqCjdddddePrpp7F9+3YQQhAKhfD222+rFqiWK6+8Env37sUbb7wBSZLwq1/9CidOnFCfDwaD\ncLlc8Hq9OH78OP793/894f1jx45Fe3t7wuttNhsaGhoQCoXwyCOPqM+JooiXX34Zfr8fJpMJHo8H\nJpMpqzEnfw6DUWswwWTUDIQQSJKEgYEBEELA8zx4nsfixYvR29uLBQsWQBRFCIKAefPmobe3Fxdc\ncAEEQYAkSSCEZLSytLmT5513Hn7961/j+9//PhoaGnDWWWfht7/9bcr3jRo1Cv/93/+Nhx56CKNH\nj0ZbWxsWLVqkPv/YY49h586d8Pl8uPrqq3HDDTckjOPhhx/GT37yE9TX1+Pxxx/H7bffjgkTJmDc\nuHGYOXMmLrzwwoTX/+53v8OkSZPg8/nw7LPP4uWXX85qzMmfw2DUGhl9LJmihRiMSoK6YBVFweDg\nIOrr69XHeD6+bpQkCaIowuFwqO/RujIBwGQywWKxwGw2w2QyMTclg1GFcGlubHOpB8JglBpFURCL\nxYZZiFQgzWazKppakqvtEEKgKAoEQVAfYwLKYNQOTDAZVQt1wUqSBI7jEkQxEokgGo2C53kIgqC6\naAkhkGUZPM8PE790AhqJRNTHmYAyGNULc8kyqhJFUSCKIhRFSRA6WZbh9/thNpvhdDohSRJ4nldf\nL0mSegye5xOEbyTxo+5b+pkAE1AGoxJhLllGTUAtRFEUASRahbFYDKFQCADgcrkSxIvneZhMJiiK\nAofDAUVRIMsyZFlGLBYDEBc/+pdKQOlj1JJNZYGazWb1jwkog1FZMMFkVA3JLlgqRoQQhMNhiKII\nj8eDQCAwolDRCFqLxQIAugkoFXMmoAxG5cEEk1EVpHPBSpKEUCgEk8kEr9ebMrhHS7pdCK2AUtdr\nJgHNFEQ0koBaLJYEIWYwGMaACSajokkWHa0YRaNRRCIROJ1OWK3WlOKjfSxbcdIKXzEENBwOq++j\nAkojeZmAMhjlgwkmo2LR5lZqrUpFURAOhyHLMrxer1rJJvm9eomP3gIKnLZoZVlWiybQfVYmoAxG\neWCCyahIknMrtS7YYDAIi8UCr9c7olVZDFIJKN0DlSQJ0WgUHMflJKDAaQuURvJyHJewB8oElMEo\nLkwwGRVFutxKQggEQYAgCHC5XLBarWUe6Wm04gggo4Cazea0+6ipBJQWX6DPMwFlMIoHE0xGxZDJ\nBUsLhKdzwRqJTAJKA5ei0ShkWc4Y/MMElMEoLUwwGRWBJEkIh8OIxWJwu93DcittNhscDkdFCkKy\ngIbDYZjN8VuTFoOn+5e5CqgoikxAGQydYILJMDRaFyyF4zgQQhCJRFQBpfmSuRx3pBSTckFdzVQ0\nky3QXARUa22nE1CaxsIElMHIDBNMhmHJVN4uGAyC5/msciuTqTRRGMmFywSUwSgNTDAZhiO5vB0V\nRI7jIMsyhoaG4HA4YLPZdJ/QK6F8cqkElBACi8UCq9XKBJTBABNMhsGgk7Ysy8PK20UiESiKAq/X\nq7or9aRSxSCVgGpzQBVFyUtABUFIKArBcZyaA2o2m7MqSM9gVBNMMBmGYaTcSjrpF0MsqwltcA+Q\nv4ACGLaXGovFEI1G1eeYgDJqCTbzMMpOptxKbXk7nucRiUR0/+xqn+SzEdDkIgrZunCpgGotULoH\nygSUUW0wwWSUlUy5laFQSHXBmkwmtURcodAoW/rvWiOTgEajUVVA6WIi3aJCK6D0+4zFYmopQGqd\nMgFlVAtMMBllg1qVyS5YURQRCoVgsVgSci4ZxSGdgEajUbUQwkgWKP1/LgJq1LQeBiMdTDAZJacS\ny9vVElRAJUlS94xTWaBMQBm1BhNMRklJl1upLW/n8/nY5GkgsnXhMgFlVDtMMBklIV1uJXC6vJ3d\nbofdbk/rgtXuPTLKR7KA0hxQ
2014-06-25 22:06:16 +00:00
"text": [
2014-06-27 02:20:15 +00:00
"<matplotlib.figure.Figure at 0x106e96b00>"
2014-06-25 22:06:16 +00:00
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 7
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Splitting into training and test dataset "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is a typical procedure for machine learning and pattern classification tasks to split one dataset into two: a training dataset and a test dataset. \n",
"The training dataset is henceforth used to train our algorithms or classifier, and the test dataset is a way to validate the outcome quite objectively before we apply it to \"new, real world data\".\n",
"\n",
"Here, we will split the dataset randomly so that 70% of the total dataset will become our training dataset, and 30% will become our test dataset, respectively."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.cross_validation import train_test_split\n",
"from sklearn import preprocessing\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X_wine, y_wine,\n",
" test_size=0.30, random_state=123)"
],
"language": "python",
"metadata": {},
"outputs": [],
2014-06-27 02:20:15 +00:00
"prompt_number": 8
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that since this a random assignment, the original relative frequencies for each class label are not maintained."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print('Class label frequencies')\n",
" \n",
"print('\\nTraining Dataset:') \n",
"for l in range(1,4):\n",
" print('Class {:} samples: {:.2%}'.format(l, list(y_train).count(l)/y_train.shape[0]))\n",
" \n",
"print('\\nTest Dataset:') \n",
"for l in range(1,4):\n",
" print('Class {:} samples: {:.2%}'.format(l, list(y_test).count(l)/y_test.shape[0]))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Class label frequencies\n",
"\n",
"Training Dataset:\n",
"Class 1 samples: 36.29%\n",
"Class 2 samples: 42.74%\n",
"Class 3 samples: 20.97%\n",
"\n",
"Test Dataset:\n",
"Class 1 samples: 25.93%\n",
"Class 2 samples: 33.33%\n",
"Class 3 samples: 40.74%\n"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 9
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Feature Scaling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another popular procedure is to standardize the data prior to fitting the model and other analyses so that the features will have the properties of a standard normal distribution with \n",
"\n",
"$\\mu = 0$ and $\\sigma = 1$\n",
"\n",
"where $\\mu$ is the mean (average) and $\\sigma$ is the standard deviation from the mean, so that the standard scores of the samples are calculated as follows:\n",
"\n",
"\\begin{equation} z = \\frac{x - \\mu}{\\sigma}\\end{equation} "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"std_scale = preprocessing.StandardScaler().fit(X_train)\n",
"X_train = std_scale.transform(X_train)\n",
"X_test = std_scale.transform(X_test)"
],
"language": "python",
"metadata": {},
"outputs": [],
2014-06-27 02:20:15 +00:00
"prompt_number": 10
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"f, ax = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(10,5))\n",
"\n",
"for a,x_dat, y_lab in zip(ax, (X_train, X_test), (y_train, y_test)):\n",
"\n",
" for label,marker,color in zip(\n",
" range(1,4),('x', 'o', '^'),('blue','red','green')):\n",
"\n",
" a.scatter(x=x_dat[:,0][y_lab == label], \n",
" y=x_dat[:,1][y_lab == label], \n",
" marker=marker, \n",
" color=color, \n",
" alpha=0.7, \n",
" label='class {}'.format(label)\n",
" )\n",
"\n",
" a.legend(loc='upper right')\n",
"\n",
"ax[0].set_title('Training Dataset')\n",
"ax[1].set_title('Test Dataset')\n",
"f.text(0.5, 0.04, 'malic acid (standardized)', ha='center', va='center')\n",
"f.text(0.08, 0.5, 'alcohol (standardized)', ha='center', va='center', rotation='vertical')\n",
"\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
2014-06-27 02:20:15 +00:00
"png": "iVBORw0KGgoAAAANSUhEUgAAAmQAAAFXCAYAAAAMF1IiAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xd4VFX6wPHvpPcCIYB0QVkBC4rBZXFF1y42LCigItZV\nkbWjq9j3p6K7qygCghQREFFBBUFAEFmQ0KuQEEpCIJCQXibTzu+PMyEJJGGSzMydSd7P88yTuSX3\nvhOSl/eec+65IIQQQgghhBBCCCGEEEIIIYQQQgghhBBCCCGEEEIIIYQQQgghhBBCCCGE1wUaHYDw\nS4uAAGCbm/cVQgghhGjSioEi58sBlFZZvtvAuBpqAPpzVHyGDOAroE89jvEa8IW7AzPwPEIIzd35\nbiXwQB3bO1M9H2UBPwBX1uMcw4HfGhBbfXnrPKKeAowOQHhNFBDtfB0EBlZZnl1lvyDvh9ZgmVR+\nhkuA3ehEc4WRQQkhDOdqvnOVcnG/WOc5zgOWAt8B9zXgfEKIZmI/lUXLAOAQ8DxwBJgOxAE/AseA\nXPSVXrsq37+SyqvF4cBqYKxz333AtQ3ctwuwCihEJ7NPqL1laQC6Vexk44D1VZY/BNKBAmAD0N+5\n/lqgHLCgr2g3O9ffD+xyxpAGPFzlWAnon0secNwZq8m57QzgG/TPbB8w8jTnEUJ4R9V8FwCMBvYC\nOehW9XjntjBgpnN9HpAMJAJvAzagDP03/FEN5+iMbiE7uZHjGXRrWYWKcxcCO4FbnOvPcR7f5jxH\nrnP9DeicUYDOY69WOVZt8YIuDKcAh9H5/U1nbLWdRwhhkJMLMivwf0Aw+o+8BXCr830UMBd9pVdh\nBTDC+X44uth4AF2cPIpuuWrIvmuB99CtdH9BJ6EZtXyGAdRckF0B2IFw5/JQdMINAJ5GF50hzm2v\n1nD869GFIcBfgRLgAufy/wGfosdeBjpjxHnsjcDLzti7oIu5q+s4jxDCO6rmu1HAGvQFVDAwAZjl\n3PYI8D0675mA3ujWLqiex2rSmZoLsjOd67s7l28H2jjf34nuWm3tXL6PU7sSLwN6Ot+fiy7ubnYh\n3u/QuSocaAWso/LisqbzCB8gXZYCdMJ4FV2YmdFXTd853xcD/0InhtocRF+NKXTh0ZbKKzVX9+2I\nHv81Bn319j90sjHVfJhaHXZ+T5xz+Uv01aMD+DcQSmVyNNVw/EXoBA66BexndGEGuphsi06+dmeM\nABejW8/ecsa+H5gM3FXHeYQQ3vcI+sLpMDrfvY4ukgLRf98tgbPQ+WkzuhWpQkP+hg87v7Zwfp1H\nZYvZXCAV6FvH8X9Ft6QBbAfmUJmLa4u3NXAd8BS6NSwb+C/V85HwQVKQCdB/sJYqyxHAROAAupXq\nV3QTeG1/yFWb5EudX6Pque8Z6ELQXGV7TS1gp9MOnZzyncvPorsg89GFWSy6eKrNdcDv6C7JPHSL\nWUvntrHo7oaf0S1gLzjXd3LGn1fl9SK1F6VCCGN0Rl9sVvyd7kJfRCWih0csQRc9mcC7VB9T6+o4\nsqoqhnpUdA3eiy6cKs7fi8r8UpO+6Na5Y+gc9kiV/WuLtxO69e9IlfNMQLeUCR8mBZmAUxPNM8DZ\nQBK6gLkMz7fyHEFfRYZXWdexAce5Fd19WAZcCjwH3IFuMYtHF5gVn+Pkzx2KHgf2HjpBx6NbzCr2\nL0YXeF2Bm9BdoFegx3bsd+5f8YpBDyQG3TonhDBeOnpcZ9W/1Qh0/rEBb6C7CPuh/37vdX5fQ4ox\n0PnoKLAHXShNAh5H57p4YAe15yPQ3anzgfboHDaByv+3a4s3HT1utWWVzxiL7vJszGcRHiYFmahJ\nFLqgKUAnjlfr3t0tDqIH3b+Gvrr7MzrBuJI8TOgr0VfR49Necq6PRietHPS4sTHoQqlCFvqKuSIh\nhjhfOegi6joqx4HhjKebc/9CdLelHT2Ytgh9Y0Q4uvujF5VTcBw96TxCCGNMQA/BqLjYa4W+uAI9\nLvVc9N9vEbpL0+7cdhR9IXY6FX/jrYEn0DnnRee6SHQ+y0H/33s/Ok9UOIouvIKrrItCt3BZ0BfI\nQ6jMibXFm4Vuxf83OgcGOGOvGHpR03mED5CCTMCpRc9/0YVFDnoA7E817FP1e0/e1tB9h6ILsePo\nu4K+onpX6snfdwaV8/4ko68ULwOWOfdZ7HyloLtfy9BXjxW+dn49ji4Gi4An0WM7ctHzFS2osn83\n9N2fReifyyfo7lwHuli7AH2HZTb6Srii+Dv5PEIIY3yIHpv6M/qiai260AE92P5r9IXoLvQd4l9U\n+b7b0Xnhv3UcPx/dkr4N3RJ3OzDNuW0X8IHznFnoYmx1le9djh4vloXuogR4DN0KVgi8gs6JFeqK\n9170xeUuZ8xfU3kzQU3nEaJWgeh+9h+MDkQY6iu80zonhBBCGMpXW8hGoSt76etuXvqgm9YD0N2F\nN6HHTwghhBDCy9qju5wuR1rImpuB6C7FEvSs+zLDtRBCCGGQr9ET3F2GFGRCCCGEaAZ87bmFA9GD\nDDej7yCpUdeuXVVaWpq3YhJCGC8NfVOF35P8JUSzdNoc5mtjyPqhxw3tRz8A9gpqeORMWloaSimf\nf7366quGxyBxSqxNIU5cm3LAL0j+ar6xSpzNM06lXMthvlaQvQR0QD8L8C7gFyon5hNCCCGEaJJ8\nrSA7mdxlKYQQQogmz9fGkFX1q/PltwYMGGB0CC6RON3PX2L1lziF9/nT74a/xCpxupe/xOkqf32U\ni3L2yQohmgGTyQT+m69OJvlLiGbGlRzmyy1kQvi8Fi1akJeXZ3QYTUZ8fDy5ublGhyFEsyE5zL0a\nk8P89YpTrjCFTzCZTMjvovvU9vOUFjIhPENymHs1Jof5+qB+IYQQQogmTwoyIYQQQgiDSUEmhBBC\nCGEwKciEEEIIIQwmBZkQzdC0adO49NJLjQ5DCCHqranmLynIhBAe9/HHH9OnTx/CwsK4//77jQ5H\nCCFc5q38JQWZEAbIzYWZM8Hh0Ms7dsDy5cbG5Ent2rXjlVdeYcSIEUaHIoRoJKVg1izIztbLJSUw\ndSpYrcbG5Sneyl9SkAnhAWvWwB9/6PdKwXffQdW5F8PDYedO+Phj2L4d3nkHEhKqH6OsrO5lV2Rk\nZDBo0CASExNJSEhg5MiRNe43atQoOnbsSGxsLH369GH16tUntiUnJ9OnTx9iY2Np06YNzzzzDABm\ns5lhw4aRkJBAfHw8SUlJHDt2rMbj33rrrdx88820bNmy/h9CCOFVubk6Z1VMp7VrF6xdW7ndZILI\nSHjpJTh4EF57DcrLIajKVPMOh15Xoays8niuam75SwoyITwgNBTeflsXZdOmwa+/Vk9W4eEwZgws\nXaqT2rPPwvnnVz/G//0fzJ+v32/dCiNHgtnsegx2u52BAwfSpUsXDh48SGZmJnfffXeN+yYlJbF1\n61by8vIYMmQId9xxBxaLBdDJ7qmnnqKgoIB9+/YxePBgAKZPn05hYSGHDh0iNzeXiRMnEh4eXmdM\nMgGlEL4vOBhWrIAZM3Qx9q9/QVhY9X1uvhn+9jd44gmIiYFHHtGFWoVff9U5rqwMiovhn/+E3393\nPYbmmL+kIBPCAy66CJ56Cp5/Hr79Ft58E6Kjq++TlqYLN4BVqyq7Lys8+SQsWgSvvw5jx8I//nFq\nUqxLcnIyR44cYezYsYSHhxMaGkq/fv1q3Hfo0KHEx8cTEBDA008/TXl5OXv27AEgJCSE1NRUcnJy\niIiIICkp6cT648ePk5qaislkonfv3kSf/CFPYjI1lcn2hWi6oqP1BeW8efDCC/DMM9C7d/V9Skpg\n40b9Pj0dcnKqb7/sMujQAZ57Tr969IBLLnE9huaYv6QgE8IDlIJt2/T7wEA4dKj69sOHdTflK6/A\n3Llw5Ah8+WX1fRISYPBg2LABzjkHevWqXwwZGRl06tSJgIDT/5m///779OjRg7i4OOLj4ykoKCDH\nmWGnTJlCSkoK55xzDklJSSxc
"text": [
2014-06-27 02:20:15 +00:00
"<matplotlib.figure.Figure at 0x106fe0c50>"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 11
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"PCA\"></a>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Linear Transformation: Principal Component Analysis (PCA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The main purposes of a principal component analysis are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information.\n",
"\n",
"Here, our desired outcome of the principal component analysis is to project a feature space (our dataset consisting of n x d-dimensional samples) onto a smaller subspace that represents our data \"well\". A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data \"best\".\n",
"\n",
"If you are interested in the Principal Component Analysis in more detail, I have outlined the procedure in a separate article \n",
"[\"Implementing a Principal Component Analysis (PCA) in Python step by step](http://sebastianraschka.com/Articles/2014_pca_step_by_step.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we will use the [`sklearn.decomposition.PCA`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to transform our training data onto 2 dimensional subspace:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.decomposition import PCA\n",
2014-06-26 21:41:00 +00:00
"sklearn_pca = PCA(n_components=2) # number of components to keep\n",
"sklearn_transf = sklearn_pca.fit_transform(X_train)\n",
"\n",
"plt.figure(figsize=(10,8))\n",
"\n",
"for label,marker,color in zip(\n",
" range(1,4),('x', 'o', '^'),('blue', 'red', 'green')):\n",
"\n",
" plt.scatter(x=sklearn_transf[:,0][y_train == label],\n",
" y=sklearn_transf[:,1][y_train == label], \n",
" marker=marker, \n",
" color=color,\n",
" alpha=0.7, \n",
" label='class {}'.format(label)\n",
" )\n",
"\n",
"plt.xlabel('vector 1')\n",
"plt.ylabel('vector 2')\n",
"\n",
"plt.legend()\n",
"plt.title('Most significant singular vectors after linear transformation via PCA')\n",
"\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
2014-06-27 02:20:15 +00:00
"png": "iVBORw0KGgoAAAANSUhEUgAAAl4AAAH4CAYAAACbjOPoAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xl4VOX5//H3TPaEEAIRQSSACxZ3LWKlWqm2ihWXWqks\noli/rUur/pC2iFVErUuLWhRQUFFUBNz3fQERFwIICCgQZE1YkpCVLCSZmd8f94SZhCyTZDKT5fO6\nrlzMds6558zhzD3385znAREREREREREREREREREREREREREREREREREREREREREREZFGKwL6BmE9\no4GP/O7/EkgHCoFLgPeBq4KwnZbyPjAmBNuZDLwQgu20F1XHURFwcQttww0c4b39BHBHC22nLYkD\n3gHygZfCHEttap5vQiEVOw4dId6uiDTBVmA/0K3G4yuxk35qM9fv/8XRWnwG3BTibS4Crg3xNhvr\nLlp/4rWI1rMfax5HLXGst8b/P2DnjXPCtO0xwFLAGabt++uLfUatIZamGgu4sMStADv3X+j3fGdg\nKrDN+5pNwP84+DtjEZALRLdotB1UWz7A5GAeYDMw0u+xE7BflZ4gbaO1/QpLBX4I8TaDtS9bUnM+\np1CdF5qzHx0E91is7Thq6vojmxlLS6lrn3nqeLxKS76fPsBGLOFprJaKq7Wd4xrrKyAR6ALMBl4G\nkrAk6jNgAHC+9zVnADnAIL/l+3rvZ9Fy1V+RdmML8C8gze+xh4DbqV7xSgKex/5jbfUuU3WyOQr4\nAiv9ZwPzvY8v9q5jH/ZLaXgt26+57AK/5/x/7XfDmhcKvLH+G/iyxmuvw07IecB0v+fG+r32J+zX\nXQnW1BjNwVWUP2NfqIXAOuAU7+O3Yb/2qh6/tMY2lgBTsF99m4Gh3ufuAyqBUu9+eKyW/RALzMVO\naHne93iI9zn/+OrbDkA/bL8XAp8AM/BVsYYAO2psdyu+ysVkqle8XgF2YZ/NF8Cxfs/NwZq+3sc+\n35rVjyuAZTUeGwe85b0dgx1n24Dd3nXF+r32EmAV9nlvwk76de3Hwd5t5WP77Qy/9SzCjpWvsM/8\nSGwf/oTto83AKGo3CPgG+zx2AtOAKO9z/sdREfA1tR/rw7zvI88bwwl+698K/BP43vueakte/f8P\nzAHu9d4eAmQAtwJ7vPGN9Vuuvv3bBXgX+7+ci/2/6uW37CKq77OaFbcXarz3v+Or/PzJu81F3tc2\ndAzN8MZSCHxbY1v/8763AmwfHQfcjVXoy73bvgY7D92B7c89wHNYlYZa4voCuNr73h7BPpdN2DF0\nDbDduw7/rgcXYlWgAu/zd/k9t927/iLve/gF1c830PDxeQ/2f7oQa6KsWUmq8iPVK1GR2DnzZA6u\nvF2D7xz2E/CXOtZJLfEmeNd1KvB/2PETX8/yAJOAt7HvhXcaeK1Ih7cFOBdYD/wMiMC+nFOpnng9\nD7yB/afsA2zATmZgidZE7+1o7ERTpaGmkkCXXQDMw748BmAnvMU1Xvs2dsLtjX2pnO99bizVTyxb\nqJ4oLPR7L8OxL7Sfe+8fiW8fXA708N7+I/Yle6jfNsqxBMkBXA9k1rGN2lznjT/Wu/wp2K/Lmss2\ntJ1vgP9iJ+VfYl8Wz3ufG8LBiZf/vphM9cRrLPZ5R2Ffgiv9npuDfZFUfYnE1FhvHHbSP8rvsWXY\nfsO7vjexJKCT973f731ukHfd53rvHwYc471dcz92xb48R2NfOiOwZCLZ+/wi7At5gPf5JGyfHO19\n/lCqJwP+TvXG4sSO+R+AW/yer3kc1TzWT8G+xE/DPqurvMtUJW9bge+wpKfm/qttnc9iX9Jgn2UF\n9plFABcAxd73B/Xv367A77FjrRNW3XjDb5uLqL7PaqsS1Xzvfb2xzsE++6r3M5b6j6EcYKD3PczF\n96PtfGA5vgTqGHz/9+7Cd0yDHQ/p3hgSgNf8nq8ZV6w3pgosAXNgyWwGvsT6t9ixW5VsnI0lfWCJ\n827shwHYcVGzqXEsvvNNIMdnOvb/JBY7vh+gdndi+6jKhdgPQP/3WRXH77AfYQC/wo6NU6idf7yR\n2DFegO37Bdhx15BN2Hs8Gjs/dQ9gGZEOqyrx+hd2Yh6K/eqKwJd4RWC/Mn/mt9xfsJME2C/MWVT/\n1VylocQrkGUjsP/MR/s9dy8HV7z8k7aXgAne22MJPPH6iMD7f63EV1Yfi51Aq8R7Y6o6AS2k/r5J\n13BwRaS2+OrbTir2heJfOXqBpide/rp4t1OVDM7x/tXnBezLAuyzK8SXWO6j+nFxBlZ9AjseHq5j\nnTX34xisUuLva+xLter1k/2eS8C+CC/Dvogb4/8Br/vdbyjxegJfolRlPXCW3/JjG9hmzcTLv+JV\nQvUv/D1YotjQ/q3pZCwZqFJzn9WmrsSrbz3L1DyGngWe9Hv+Aqyqg3fdG4DTObgSOJnqx+ln2A+Q\nKv2x84WzjrjGYpXxKid4X3OI32M5wIl1vI+pWLUMau/jNRbf+SaQ4/N2v+duAD6oY7tH4vs/BPAi\nvostaovD3xvAzXU8NxY7b+RhFbSv8X22H+NL2OtyJlaxrfpcV2H/VySI1Mer/fFgJ7LR2Anhear3\nWUjBfglu83tsO75k6Z/e16cBa7EkIlCBLHsI9kvMP2nIqOV1u/1ul2Bfso11OFaar81VWLKV5/07\nnurNAjW3D1ZRqFJf/6QXsKRvAVbB+g9190epazuHYV+gZX7P76Bp/U8igAexX7IF2Bct2LEA9l5q\nJnE1zcPXd3AUdvIvwz7PeGAFvn35gd+66/sMqrZd5TDsWPS3zft4Ff84i7Fm0Oux5rl38VXTaurv\nfX4Xtg/uo+5moNr0Acbje4952HurK7bG2kv1fk4l2HHQ0P6Nx5Lbrdj7+gKrlPkfJ02Ny385J/Uf\nQ2DJYpVSfP9fPse6C8zwvmYWvi/2mnpy8LkpEl81umZctW0XLOmoLZbTsQQpC6vEXkfgx0Egx6f/\n/2f/7db0E5aYXox9hhdh/8dqcwGW8O3FPv/fNRDzt1gV7hDsB+zn3sf31oi1NldjCVqR9/4r+BJL\nCRIlXu3TduwX8QVU/1UP9uuvguq/GlPxJT97sApYL+yk9DiBX4kVyLLZWN+e3n6P9aZl7KB681iV\nPtiv879izQfJWKIYaFLTUKfwSqw6chx24htG44e42OWNzb+Sk+q37WKq99WIoPqvfH+jsBP8udiX\nclWzRWOSuE+96z8Ja2Kp+pLIwb5gjsX2YzJWDalqVqrrM4CD92Mm9tn460P15teay3wMnIc1Xa0H\nnqpjW09gzYtHYfvgXzTu/LcdS9aS/f46UX0IhMZeLBDI6xvav+OxpHIQ9r7O5uBO9A1tp67n/R8f\nTfOOoWlYM+Sx3nj/UcfrdnLwuamS6slVcy7KmIc12x6O7ceZ+I6DhtYbyPHZGPOxHzOXYMdmbVXM\nGKy59b9YJTwZ64vZlB9gn2LNvnX18YrDug+cg51/dmHH10nUXTGUJlDi1X5di/0HKq3xuAvrB3If\n9sXRB+soXdXfYDh2UgL7RejB90t8D1Yir0t9y/pv/3WsiSEOa/IcQ/0nvaZewfY01ln4VO/yR2En\n8gTv9nKw/wPXYBWvQDW0H4ZgTR4R2C/HCux9N8Y2rF/MZKxCeQaWwFXZiDVT/M77/B3U3beoE9a8\nnIu995rNDYHs2wrs1+9D2Mn/E+/jbizZmYov8euFJUNgV1Vdgx2LTu9zVVWpmvvxfexLeSRW5bgC\nOz7erSPW7tiXVoI3vmLq3s+dsM+ixLvOGxp4vzVjewqrrFU1/yVg/XLqqmg0JNBjuqH92wn7P16A\nJep31VxBANtp6Hiu2k5Tj6GBWKUpCtv/ZdT9Oc3Hzkd9vdu8H6scN+Wqx9p0wqpG5dhnOQrfuSfb\nu5269sUHNO74bMgCLBG6HmtqrE209y/HG9sF+D77xnoB+yH0GvZ/0IlVzm73rvdSLMkdgCVbJ3lv\nf0nrHhuxzVHi1X5txjr7VvFP
2014-06-25 22:06:16 +00:00
"text": [
2014-06-27 02:20:15 +00:00
"<matplotlib.figure.Figure at 0x106fe0780>"
2014-06-25 22:06:16 +00:00
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 12
2014-06-26 21:41:00 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"PCA for feature extraction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As mentioned in the short introduction above (and in more detail in my separate [PCA article](http://sebastianraschka.com/Articles/2014_pca_step_by_step.html)), PCA is commonly used in the field of pattern classification for feature selection (or dimensionality reduction). \n",
"By default, the transformed data will be ordered by the components with the maximum variance (in descending order). \n",
"\n",
"In the example above, I only kept the top 2 components (the 2 components with the maximum variance along the axes): The sample space of projected onto a 2-dimensional subspace, which was basically sufficient for plotting the data onto a 2D scatter plot.\n",
"\n",
"However, if we want to use PCA for feature selection, we probably don't want to reduce the dimensionality that drastically. By default, the `PCA` function (`PCA(n_components=None)`) keeps all the components in ranked order. So we could basically either set the number `n_components` to a smaller size then the input dataset, or we could extract the top **n** components later from the returned NumPy array.\n",
"\n",
"To get an idea about how well each components (relatively) \"explains\" the variance, we can use `explained_variance_ratio_` instant method, which also confirms that the components are ordered from most explanatory to least explanatory (the ratios sum up to 1.0)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sklearn_pca = PCA(n_components=None)\n",
"sklearn_transf = sklearn_pca.fit_transform(X_train)\n",
"sklearn_pca.explained_variance_ratio_"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
2014-06-27 02:20:15 +00:00
"prompt_number": 13,
2014-06-26 21:41:00 +00:00
"text": [
"array([0.36, 0.21, 0.10, 0.08, 0.06, 0.05, 0.04, 0.03, 0.02, 0.02, 0.01,\n",
" 0.01, 0.01])"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 13
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='MDA'></a>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"Linear Transformation: Linear Discriminant Analysis (MDA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"The main purposes of a Linear Discriminant Analysis (LDA) is to analyze the data to identify patterns to project it onto a subspace that yields a better separation of the classes. Also, the dimensionality of the dataset shall be reduced with minimal loss of information.\n",
"\n",
"**The approach is very similar to a Principal Component Analysis (PCA), but in addition to finding the component axes that maximize the variance of our data, we are additionally interested in the axes that maximize the separation of our classes (e.g., in a supervised pattern classification problem)**\n",
"\n",
2014-07-02 14:31:44 +00:00
"Here, our desired outcome of the Linear discriminant analysis is to project a feature space (our dataset consisting of n d-dimensional samples) onto a smaller subspace that represents our data \"well\" and has a good class separation. A possible application would be a pattern classification task, where we want to reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of our feature space by extracting a subspace that describes our data \"best\"."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"Principal Component Analysis (PCA) Vs. Linear Discriminant Analysis (LDA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"Both Linear Discriminant Analysis and Principal Component Analysis are linear transformation methods and closely related to each other. In PCA, we are interested to find the directions (components) that maximize the variance in our dataset, where in LDA, we are additionally interested to find the directions that maximize the separation (or discrimination) between different classes (for example, in pattern classification problems where our dataset consists of multiple classes. In contrast two PCA, which ignores the class labels).\n",
"\n",
2014-07-02 14:31:44 +00:00
"**In other words, via PCA, we are projecting the entire set of data (without class labels) onto a different subspace, and in LDA, we are trying to determine a suitable subspace to distinguish between patterns that belong to different classes. Or, roughly speaking in PCA we are trying to find the axes with maximum variances where the data is most spread (within a class, since PCA treats the whole data set as one class), and in LDA we are additionally maximizing the spread between classes.**\n",
"\n",
2014-07-02 14:31:44 +00:00
"In typical pattern recognition problems, a PCA is often followed by an LDA."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"![](../Images/lda_overview.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"If you are interested, you can find more information about the LDA in my IPython notebook \n",
"[Stepping through a Linear Discriminant Analysis - using Python's NumPy and matplotlib](http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/dimensionality_reduction/projection/minear_discriminant_analysis.ipynb?create=1)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"Like we did in the PCA section above, we will use a `scikit-learn` funcion, [`sklearn.lda.LDA`](http://scikit-learn.org/stable/modules/generated/sklearn.lda.LDA.html) in order to transform our training data onto 2 dimensional subspace:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.lda import LDA\n",
"sklearn_lda = LDA(n_components=2)\n",
2014-06-27 02:20:15 +00:00
"transf_lda = sklearn_lda.fit_transform(X_train, y_train)\n",
"\n",
"plt.figure(figsize=(10,8))\n",
"\n",
"for label,marker,color in zip(\n",
" range(1,4),('x', 'o', '^'),('blue', 'red', 'green')):\n",
"\n",
"\n",
2014-06-27 02:20:15 +00:00
" plt.scatter(x=transf_lda[:,0][y_train == label],\n",
" y=transf_lda[:,1][y_train == label], \n",
" marker=marker, \n",
" color=color,\n",
" alpha=0.7, \n",
" label='class {}'.format(label)\n",
" )\n",
"\n",
"plt.xlabel('vector 1')\n",
"plt.ylabel('vector 2')\n",
"\n",
"plt.legend()\n",
"plt.title('Most significant singular vectors after linear transformation via LDA')\n",
"\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "display_data",
2014-06-27 02:20:15 +00:00
"png": "iVBORw0KGgoAAAANSUhEUgAAAmgAAAH4CAYAAAD+YRGXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xl4lPXV//H3JCErIQQCBJGwqCi4ohie0qrRPlW0Wlof\nEBVxrxUtUtSfuAIutCq4ggsuLSBFEEXrjktBQBQoCioUREB2QkJCErJnZn5/nBlmErJMQiYzST6v\n65ors973mXsmM2fOdwMRERERERERERERERERERERERERERERERERERERERERERERCbICoGcjbGcE\nsNDv8i+BTUA+MAT4ELi6EfYTLB8CI5tgPxOB15pgPy2F931UAPwuSPtwAb09518A7g/SfpqTOOA9\n4AAwL8SxVKfq501TSMPeh44m3q+IBNHPQCnQscr132JfDmlHuH3/L5hw8Tkwuon3uRi4oYn3WV8T\nCP8EbTHhcxyrvo+C8V4Px/8fsM+N80K075HACiAiRPv31xN7jcIhloa6Flhaw22LgWLsx2we8B9g\nHBBdzX1nAOVAamMHKIdrzm84CZwb2AJc4XfdydivVHcj7SPcftWlAeubeJ+NdSyD6Uhep6b6vDiS\n4+igcd+L1b2PGrr9qCOMJVhqOmbuGq73Cubz6QH8iCVG9RWsuMLtM66xuIFbgXZY4nUHcDlW1feX\nAPwf9v9wVVMGKNKSbQXuA1b6XTcFuJfKFbQkYBawD/v1fB++D6VjgS+wJocs4HXP9Us82ziIld+H\nVbP/qo+d63ebf/WgI9askeeJ9REq/+pzAX/CPrhzgWl+t13rd9/NgBMown4VRnN4VeaP2AdNPrAO\n6O+5/m7gJ7/rf19lH8uAyUAOlvQO9tw2CajAfokWAM9WcxxigdlAtif+lUAnz23+8dW2H4Be2HHP\nBz4FnsNXFcsAdlTZ78/4KiETqVxBmw/swV6bL4B+frfNwJrcPsRe36rVlOHAqirXjQX+5Tkfg73P\ntgF7PduK9bvvEGAN9nr/BFxAzcdxkGdfB7Dj9gu/7SzG3itfYq/5Mdgx3Iwdoy3AlVQvHfgKez12\nA1OBNp7b/N9HBcByqn+vX+x5HrmeGE722/7PwF3Ad57nVF2S6/8/MAN42HM+A9gJ3A5keuK71u9x\ntR3f9sD72P9yDvZ/1c3vsYupfMyqVvBeq/Lc78RXSbres8/FnvvW9R56zhNLPvB1lX095Xluedgx\nOhF4EKv4l3n2fR32OXQ/djwzgZlYQkE1cX0BXON5bk9ir8tP2HvoOmC7Zxv+XR5+i7Uo5Hlun+B3\n23bP9gs8z+F/OLwiVdf78yHsfzofaxqt2prh9V9PLF5R2GfmaRxeybsO32fYZuCmGrZJNfH6W8Th\nFevuQGGVWK7GXqMRwPe17EtE6mEr8GtgA3ACEIl9iadROUGbBbyN/VLqAWzEPvTAErJ7POejsQ8k\nr7qaaAJ97FxgDvYl0xf7YFxS5b7vYh/M3bEvnws8t11L5Q+grVROKBb5PZdh2BffGZ7Lx+A7BkPx\nle8vw76Mu/jtowz7MHMANwO7athHdf7kiT/W8/j+QGI1j61rP18Bj2Mf3r/EvlRmeW7L4PAEzf9Y\nTKRygnYt9nq3wb4sv/W7bQb2heP9sompst047MvhWL/rVmHHDc/23sGShbae5/5Xz23pnm3/2nP5\nKOB4z/mqx7ED9iU7AvtyuhxLOpI9ty/Gvrj7em5Pwo7JcZ7bu1A5afB3uieWCOw9vx4Y43d71fdR\n1fd6f+zL/kzstbra8xhvkvcz8A2WHFU9ftVt8x/YlznYa1mOvWaRwIXYl2aS5/bajm8H4A/Ye60t\n8Ab2v+21mMrHrLqqU9Xn3tMT6wzstfc+n2up/T2UDQzwPIfZ+H7cXYA1p3kTrePx/e9NwPeeBns/\nbPLEkAC85Xd71bhiPTGVY4maA0t6d+JLwH+DvXfjPds4B0sOwRLsvdgPCLD3RdUmzmvxfd4E8v7c\nhP2fxGLv779RvQewY+T1W+yHov/z9MZxEfZjDeBs7L3Rn+r5x1tVTZ9bXwCP+l3+HPvRnoj92Di9\nhu2JSD14E7T7sA/wwdivuEh8CVok9qv1BL/H3YT984L9Yp1O5V/hXnUlaIE8NhJLSo7zu+1hDq+g\n+Sd387C+ElC/BG0hgfdP+xZf5/BrsQ9ar3hPTJ399lFb36nrOLzCUl18te0nDfvi8a9EvUbDEzR/\n7T378SaNMzyn2ryGfamAvXb5+BLQg1R+X/wCq2aBvR+eqGGbVY/jSKzy4m859uXrvf9Ev9sSsC/M\nS7Ev7Pr4C7DA73JdCdoL+BIqrw3AWX6Pv7aOfVZN0PwraEVUTgwysYSyruNb1WlY0uBV9ZhVp6YE\nrWctj6n6HvoH8JLf7RdiVSI8294IDOTwyuJEKr9PP8d+qHj1wT4vImqI61qs0u51suc+nfyuywZO\nqeF5PI1V36D6PmjX4vu8CeT9ea/fbaOAj2rY7zH4/ocA/olv0Eh1cfh7G7ithtv8462qpgTtdXyv\nXRpWUe3jufwOdowkiNQHrfVwYx94I7APjllU7lORgv2y3OZ33XZ8SdVdnvuvBH7Ako1ABfLYTtiv\neP/kYmc199vrd74I+zKur6OxJoHqXI0lZbme00lUbo6oun+wCoVXbf2nXsOSw7lYRewxau4vU9N+\njsK+aEv8bt9Bw/rHRGK/kH/CKk5bPdeneP66OTzZq2oOvr6NV2JfEiXY6xkPrMZ3LD/y23Ztr4F3\n315HYe9Ff9s813v5x1mINb/ejDULvo+vOldVH8/te7BjMImam5+q0wPrs5Prdzq6ltjqaz+V+2EV\nYe+Duo5vPJYE/4w9ry+wypv/+6Shcfk/LoLa30NgSaVXMb7/l39j3RSe89xnOr7ErqquHP7ZFIWv\nul01rur2C9ZcWF0sA7FEZR9W2f0Tgb8PAnl/+v8/+++3qs1YAvs77DW8BPsfq86FWGK4H3v9L6pH\nzIE42rNtsCT0B3xJ73zs/z1c+1W2CErQWpft2C/sC6lcJQD7NVlO5V+hafiSpEysotYN+/B6nsBH\nngXy2Cys71F3v+u6Exw7qNws59UD+8V4K9ZskYx9KAWa/NTVub0Cq7aciFUCL6b+U3/s8cTmXxlK\n89t3Ib5mG7AkzL9q4O9K7Ivg19iXt7e5pD7J3mee7Z+KNe14v0yysS+ifthxTMaqK97mrJpeAzj8\nOO7CXht/Pajc7Fv1MZ8A52NNZhuAl2vY1wtYs+ax2DG4j/p9Lm7Hkrpkv1NbKk8NUd9BD4Hcv67j\neweWfKZjz+scDh8MUNd+arrd//oRHNl7aCrW/NnPE+//q+F+uzn8s6mCyknYkQwumYNVhY7GjuOL\n+N4HdW03kPdnfbyO/egZgr03q6uKxmDNvI9jlfVkrK9oYw1k6I41YXqrbldjFfI9ntPTWBJ+USPt\nT6qhBK31uQFrWiiucr0T66cyCfuC6YF1+Pb2hxiGfXiB/cJ04/tln4mV5mtS22P9978Aa9qIw5pa\nR1L7h2NDR+y9gnV6Pt3z+GOxD/wEz/6ysf+N67AKWqDqOg4ZWFNLJNbhuBx73vWxDeu3MxGreP4C\nS/S8fsSaRy7y3H4/Nfd9aos1a+dgz/2vVW4P5NiWY7+mp2BfEp96rndhSdHT+BLEbljSBPAqdnzP\nw451N3xVrqrH8UPsy/sK7Bf7cOz98X4NsXbGvtwSPPEVUvNxbou9FkWebY6q4/lWje1lrFLnbXZM\nwPoN1VQhqUug7+m6jm9b7H88D0voJ1TdQAD7qev97N1PQ99DA7DKVRvs+JdQ8+v0OvZ51NOzz79i\nleiGjPKsTlusClWGvZZX4vvsyfLsp6Zj8RH1e3/WZS7WP+9mrImzOtGeU7YntgvxvfY1cWCfBbF+\np6rxxWPJ/L+waU4+xD5jemP9LE/1nE7Cktpwnluy2VOC1vpswTote/knQKOxL7Mt2C+nfwJ/99w2\nACunF2D/vLdhzSdgycJM7ANu
"text": [
2014-06-27 02:20:15 +00:00
"<matplotlib.figure.Figure at 0x107d32e80>"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 14
2014-06-26 21:41:00 +00:00
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"LDA for feature extraction"
2014-06-26 21:41:00 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-07-02 14:31:44 +00:00
"If we want to use LDA for projecting our data onto a smaller subspace (i.e., for dimensionality reduction), we can directly set the number of components to keep via `LDA(n_components=...)`; this is analogous to the [PCA function](#PCA-for-feature-extraction), which we have seen above.\n"
2014-06-26 21:41:00 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Simple Supervised Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Linear Discriminant Analysis as simple linear classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The LDA that we've just used in the section above can also be used as a simple linear classifier."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# fit model\n",
"lda_clf = LDA()\n",
"lda_clf.fit(X_train, y_train)\n",
"LDA(n_components=None, priors=None)\n",
"\n",
"# prediction\n",
"print('1st sample from test dataset classified as:', lda_clf.predict(X_test[0,:]))\n",
"print('actual class label:', y_test[0])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1st sample from test dataset classified as: [3]\n",
"actual class label: 3\n"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another handy subpackage of sklearn is `metrics`. The [`metrics.accuracy_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), for example, is quite useful to evaluate how many samples can be classified correctly:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn import metrics\n",
2014-06-27 02:20:15 +00:00
"pred_train_lda = lda_clf.predict(X_train)\n",
"\n",
"print('Prediction accuracy for the training dataset')\n",
2014-06-27 02:20:15 +00:00
"print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train_lda)))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Prediction accuracy for the training dataset\n",
"100.00%\n"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 17
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To verify that over model was not overfitted to the training dataset, let us evaluate the classifier's accuracy on the test dataset:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
2014-06-27 02:20:15 +00:00
"pred_test_lda = lda_clf.predict(X_test)\n",
"\n",
"print('Prediction accuracy for the test dataset')\n",
2014-06-27 02:20:15 +00:00
"print('{:.2%}'.format(metrics.accuracy_score(y_test, pred_test_lda)))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Prediction accuracy for the test dataset\n",
"98.15%\n"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Confusion Matrix** \n",
"As we can see above, there was a very low misclassification rate when we'd apply the classifier on the test data set. A confusion matrix can tell us in more detail which particular classes could not classified correctly.\n",
"\n",
2014-06-26 04:24:05 +00:00
"<table cellspacing=\"0\" border=\"0\">\n",
"\t<colgroup width=\"60\"></colgroup>\n",
"\t<colgroup span=\"4\" width=\"82\"></colgroup>\n",
"\t<tr>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" colspan=2 rowspan=2 height=\"44\" align=\"center\" bgcolor=\"#FFFFFF\"><b><font face=\"Helvetica\" size=4><br></font></b></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" colspan=3 align=\"center\" bgcolor=\"#FFFFFF\"><b><font face=\"Helvetica\" size=4>predicted class</font></b></td>\n",
"\t\t</tr>\n",
"\t<tr>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#EEEEEE\"><font face=\"Helvetica\" size=4>class 1</font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#EEEEEE\"><font face=\"Helvetica\" size=4>class 2</font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#EEEEEE\"><font face=\"Helvetica\" size=4>class 3</font></td>\n",
"\t</tr>\n",
"\t<tr>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" rowspan=3 height=\"116\" align=\"center\" bgcolor=\"#F6F6F6\"><b><font face=\"Helvetica\" size=4>actual class</font></b></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#EEEEEE\"><font face=\"Helvetica\" size=4>class 1</font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#99FFCC\"><font face=\"Helvetica\" size=4>True positives</font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#F6F6F6\"><font face=\"Helvetica\" size=4><br></font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#F6F6F6\"><font face=\"Helvetica\" size=4><br></font></td>\n",
"\t</tr>\n",
"\t<tr>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#EEEEEE\"><font face=\"Helvetica\" size=4>class 2</font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#FFFFFF\"><font face=\"Helvetica\" size=4><br></font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#99FFCC\"><font face=\"Helvetica\" size=4>True positives</font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#FFFFFF\"><font face=\"Helvetica\" size=4><br></font></td>\n",
"\t</tr>\n",
"\t<tr>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#EEEEEE\"><font face=\"Helvetica\" size=4>class 3</font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#F6F6F6\"><font face=\"Helvetica\" size=4><br></font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#F6F6F6\"><font face=\"Helvetica\" size=4><br></font></td>\n",
"\t\t<td style=\"border-top: 1px solid #c1c1c1; border-bottom: 1px solid #c1c1c1; border-left: 1px solid #c1c1c1; border-right: 1px solid #c1c1c1\" align=\"left\" bgcolor=\"#99FFCC\"><font face=\"Helvetica\" size=4>True positives</font></td>\n",
"\t</tr>\n",
"</table>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print('Confusion Matrix of the LDA-classifier')\n",
"print(metrics.confusion_matrix(y_test, lda_clf.predict(X_test)))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Confusion Matrix of the LDA-classifier\n",
"[[14 0 0]\n",
" [ 1 17 0]\n",
" [ 0 0 22]]\n"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 19
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we can see, one sample from class 2 was incorrectly labeled as class 1, from the perspective of class 1, this would be 1 \"False Negative\" or a \"False Postive\" from the perspective of class 2, respectively"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id='SGD'></a>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Classification Stochastic Gradient Descent (SGD)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us now compare the classification accuracy of the LDA classifier with a simple classification (we also use the probably not ideal default settings here) via stochastic gradient descent, an algorithm that minimizes a linear objective function. \n",
"More information about the `sklearn.linear_model.SGDClassifier` can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.linear_model import SGDClassifier\n",
"\n",
2014-06-27 02:20:15 +00:00
"sgd_clf = SGDClassifier()\n",
"sgd_clf.fit(X_train, y_train)\n",
"\n",
2014-06-27 02:20:15 +00:00
"pred_train_sgd = sgd_clf.predict(X_train)\n",
"pred_test_sgd = sgd_clf.predict(X_test)\n",
"\n",
"print('\\nPrediction accuracy for the training dataset')\n",
2014-06-27 02:20:15 +00:00
"print('{:.2%}\\n'.format(metrics.accuracy_score(y_train, pred_train_sgd)))\n",
"\n",
"print('Prediction accuracy for the test dataset')\n",
2014-06-27 02:20:15 +00:00
"print('{:.2%}\\n'.format(metrics.accuracy_score(y_test, pred_test_sgd)))\n",
"\n",
"print('Confusion Matrix of the SGD-classifier')\n",
2014-06-27 02:20:15 +00:00
"print(metrics.confusion_matrix(y_test, sgd_clf.predict(X_test)))"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"Prediction accuracy for the training dataset\n",
"99.19%\n",
"\n",
"Prediction accuracy for the test dataset\n",
"100.00%\n",
"\n",
"Confusion Matrix of the SGD-classifier\n",
"[[14 0 0]\n",
" [ 0 18 0]\n",
" [ 0 0 22]]\n"
]
}
],
2014-06-27 02:20:15 +00:00
"prompt_number": 22
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Quite impressively, we achieved a 100% prediction accuracy on the test dataset without any additional efforts of tweaking any parameters and settings."
]
2014-06-25 22:06:16 +00:00
},
2014-06-27 02:20:15 +00:00
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Decision Regions"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sgd_clf2 = SGDClassifier()\n",
"sgd_clf2.fit(X_train[:, :2], y_train)\n",
"\n",
"x_min = X_test[:, 0].min() \n",
"x_max = X_test[:, 0].max() \n",
"y_min = X_test[:, 1].min() \n",
"y_max = X_test[:, 1].max() \n",
"\n",
"step = 0.01\n",
"X, Y = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))\n",
"\n",
"Z = sgd_clf2.predict(np.c_[X.ravel(), Y.ravel()])\n",
"Z = Z.reshape(X.shape)\n",
"\n",
"# Plots decision regions\n",
"plt.contourf(X, Y, Z)\n",
"\n",
"\n",
"# Plots samples from training data set\n",
"plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
2014-06-25 22:06:16 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Saving the processed datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Pickle"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2014-06-27 02:20:15 +00:00
"The in-built [`pickle`](https://docs.python.org/3.4/library/pickle.html) module is a convenient tool in Python's standard library to save Python objects in byte format. This allows us, for example, to save our NumPy arrays and classifiers so that we can load them in a later or different Python session to continue working with our data, e.g., to train a classifier."
2014-06-25 22:06:16 +00:00
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# export objects via pickle\n",
"\n",
"import pickle\n",
"\n",
"pickle_out = open('standardized_data.pkl', 'wb')\n",
"pickle.dump([X_train, X_test, y_train, y_test], pickle_out)\n",
2014-06-27 02:20:15 +00:00
"pickle_out.close()\n",
"\n",
"pickle_out = open('classifiers.pkl', 'wb')\n",
"pickle.dump([lda_clf, sgd_clf], pickle_out)\n",
2014-06-25 22:06:16 +00:00
"pickle_out.close()"
],
"language": "python",
"metadata": {},
"outputs": [],
2014-06-27 02:20:15 +00:00
"prompt_number": 24
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# import objects via pickle\n",
"\n",
"my_object_file = open('standardized_data.pkl', 'rb')\n",
"X_train, X_test, y_train, y_test = pickle.load(my_object_file)\n",
2014-06-27 02:20:15 +00:00
"my_object_file.close()\n",
"\n",
"my_object_file = open('classifiers.pkl', 'rb')\n",
"lda_clf, sgd_clf = pickle.load(my_object_file)\n",
"my_object_file.close()\n",
"\n",
"print('Confusion Matrix of the SGD-classifier')\n",
"print(metrics.confusion_matrix(y_test, sgd_clf.predict(X_test)))"
2014-06-25 22:06:16 +00:00
],
"language": "python",
"metadata": {},
2014-06-27 02:20:15 +00:00
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Confusion Matrix of the SGD-classifier\n",
"[[14 0 0]\n",
" [ 0 18 0]\n",
" [ 0 0 22]]\n"
]
}
],
"prompt_number": 26
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Comma-Separated-Values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[[back to top]](#Sections)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And it is also always a good idea to save your data in common text formats, such as the CSV format that we started with. But first, let us add back the class labels to the front column of the test and training data sets."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"training_data = np.hstack((y_train.reshape(y_train.shape[0], 1), X_train))\n",
"test_data = np.hstack((y_test.reshape(y_test.shape[0], 1), X_test))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 21
2014-06-25 22:06:16 +00:00
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can save our test and training datasets as 2 separate CSV files using the [`numpy.savetxt`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html) function."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"np.savetxt('./training_set.csv', training_data, delimiter=',')\n",
"np.savetxt('./test_set.csv', test_data, delimiter=',')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
2014-06-25 22:06:16 +00:00
}
],
"metadata": {}
}
]
}