python_reference/useful_scripts/fix_tab_csv.ipynb

{
 "metadata": {
  "name": "",
  "signature": "sha256:996358a25da6fc77c66d183e79209307af06bd2f9abb0656d3bb70cfc2fe597a"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Sebastian Raschka 05/09/2014"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Fixing CSV files"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We have a directory `../CSV_files_raw/` with CSV files where some of them have 'tab-separated' and some of them 'comma-separated' columns.  \n",
      "Here, we will 'fix' them, i.e., have them all comma-separated, and save them to a new directory `../CSV_fixed`."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we create a dictionary with the file basenames as keys. The values are lists of the file paths to the raw and new fixed CSV files. e.g., \n",
      "\n",
      "    {\n",
      "    'abc.csv': ['../CSV_files_raw/abc.csv', '../CSV_fixed/abc.csv'], \n",
      "    'def.csv': ['../CSV_files_raw/def.csv', '../CSV_fixed/def.csv'], \n",
      "    ...\n",
      "    }"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import sys\n",
      "import os\n",
      "\n",
      "raw_dir = '../CSV_files_raw/'\n",
      "fixed_dir = '../CSV_fixed'\n",
      "\n",
      "if not os.path.exists(fixed_dir):\n",
      "    os.mkdir(fixed_dir)\n",
      "\n",
      "f_dict = {os.path.basename(f):[os.path.join(raw_dir, f),\n",
      "                                  os.path.join(fixed_dir, f)]\n",
      "             for f in os.listdir(raw_dir) if f.endswith('.csv')} "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we can replace the tabs with commas for the new files very easily:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for f in f_dict.keys():\n",
      "    with open(f_dict[f][0], 'r') as raw, open(f_dict[f][1], 'w') as fixed:\n",
      "        for line in raw:\n",
      "            line = line.strip().split('\\t')\n",
      "            fixed.write(','.join(line) + '\\n')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    }
   ],
   "metadata": {}
  }
 ]
}