python_reference/useful_scripts/fix_tab_csv.ipynb
2014-05-12 15:25:31 -04:00

94 lines
2.5 KiB
Plaintext

{
"metadata": {
"name": "",
"signature": "sha256:996358a25da6fc77c66d183e79209307af06bd2f9abb0656d3bb70cfc2fe597a"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sebastian Raschka 05/09/2014"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fixing CSV files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have a directory `../CSV_files_raw/` with CSV files where some of them have 'tab-separated' and some of them 'comma-separated' columns. \n",
"Here, we will 'fix' them, i.e., have them all comma-separated, and save them to a new directory `../CSV_fixed`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we create a dictionary with the file basenames as keys. The values are lists of the file paths to the raw and new fixed CSV files. e.g., \n",
"\n",
" {\n",
" 'abc.csv': ['../CSV_files_raw/abc.csv', '../CSV_fixed/abc.csv'], \n",
" 'def.csv': ['../CSV_files_raw/def.csv', '../CSV_fixed/def.csv'], \n",
" ...\n",
" }"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import sys\n",
"import os\n",
"\n",
"raw_dir = '../CSV_files_raw/'\n",
"fixed_dir = '../CSV_fixed'\n",
"\n",
"if not os.path.exists(fixed_dir):\n",
" os.mkdir(fixed_dir)\n",
"\n",
"f_dict = {os.path.basename(f):[os.path.join(raw_dir, f),\n",
" os.path.join(fixed_dir, f)]\n",
" for f in os.listdir(raw_dir) if f.endswith('.csv')} "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can replace the tabs with commas for the new files very easily:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for f in f_dict.keys():\n",
" with open(f_dict[f][0], 'r') as raw, open(f_dict[f][1], 'w') as fixed:\n",
" for line in raw:\n",
" line = line.strip().split('\\t')\n",
" fixed.write(','.join(line) + '\\n')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 11
}
],
"metadata": {}
}
]
}