python_reference/tutorials/useful_regex.ipynb

{
 "metadata": {
  "name": "",
  "signature": "sha256:9fd7d5201ce5b97fadad65f2c30cfec993fc83907e04418b032bd1bbdac05ff4"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[Sebastian Raschka](http://sebastianraschka.com)  \n",
      "\n",
      "- [Link to this IPython notebook on Github](https://github.com/rasbt/python_reference/blob/master/tutorials/useful_regex.ipynb)  "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%load_ext watermark"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%watermark -d -v -u -t -z"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Last updated: 06/07/2014 10:07:02 EDT\n",
        "\n",
        "CPython 3.4.1\n",
        "IPython 2.1.0\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<font size=\"1.5em\">[More information](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/ipython_magic/watermark.ipynb) about the `watermark` magic command extension.</font>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<hr>\n",
      "I would be happy to hear your comments and suggestions.  \n",
      "Please feel free to drop me a note via\n",
      "[twitter](https://twitter.com/rasbt), [email](mailto:bluewoodtree@gmail.com), or [google+](https://plus.google.com/+SebastianRaschka).\n",
      "<hr>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "A collection of useful regular expressions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Sections"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- [About the `re` module](#About-the-re-module)\n",
      "- [Identify files via file extensions](#Identify-files-via-file-extensions)\n",
      "- [Username validation](#Username-validation)\n",
      "- [Checking for valid email addresses](#Checking-for-valid-email-addresses)\n",
      "- [Check for a valid URL](#Check-for-a-valid-URL)\n",
      "- [Checking for integers](#Checking-for-integers)\n",
      "- [Validating dates](#Validating-dates)\n",
      "- [Time](#Time)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "About the `re` module"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The purpose of this IPython notebook is not to rewrite a detailed tutorial about regular expressions or the in-built Python `re` module, but to collect some useful regular expressions for copy&paste purposes."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The complete documentation of the Python `re` module can be found here [https://docs.python.org/3.4/howto/regex.html](https://docs.python.org/3.4/howto/regex.html). Below, I just want to list the most important methods for convenience:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- `re.match()`  : Determine if the RE matches at the beginning of the string.\n",
      "- `re.search()` : Scan through a string, looking for any location where this RE matches.\n",
      "- `re.findall()` : Find all substrings where the RE matches, and returns them as a list.\n",
      "- `re.finditer()` : Find all substrings where the RE matches, and returns them as an iterator."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you are using the same regular expression multiple times, it is recommended to compile it for improved performance.\n",
      "\n",
      "    compiled_re = re.compile(r'some_regexpr')    \n",
      "    for word in text:\n",
      "        match = comp.search(compiled_re))\n",
      "        # do something with the match\n",
      "    \n",
      "**E.g., if we want to check if a string ends with a substring:**"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import re\n",
      "\n",
      "needle = 'needlers'\n",
      "\n",
      "# Python approach\n",
      "print(bool(any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')])))\n",
      "\n",
      "# On-the-fly Regular expression in Python\n",
      "print(bool(re.search(r'(?:ly|ed|ing|ers)$', needle)))\n",
      "\n",
      "# Compiled Regular expression in Python\n",
      "comp = re.compile(r'(?:ly|ed|ing|ers)$') \n",
      "print(bool(comp.search(needle)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "True\n",
        "True\n",
        "True\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit -n 10000 -r 50 bool(any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')]))\n",
      "%timeit -n 10000 -r 50 bool(re.search(r'(?:ly|ed|ing|ers)$', needle))\n",
      "%timeit -n 10000 -r 50 bool(comp.search(needle))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10000 loops, best of 50: 2.74 \u00b5s per loop\n",
        "10000 loops, best of 50: 2.93 \u00b5s per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "10000 loops, best of 50: 1.28 \u00b5s per loop"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Identify files via file extensions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A regular expression to check for file extensions."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = r'(?i)(\\w+)\\.(jpeg|jpg|png|gif|tif|svg)$'\n",
      "\n",
      "# remove `(?i)` to make regexpr case-sensitive\n",
      "\n",
      "str_true = ('test.gif', \n",
      "            'image.jpeg', \n",
      "            'image.jpg',\n",
      "            'image.TIF'\n",
      "            )\n",
      "\n",
      "str_false = ('test.pdf',\n",
      "             'test.gif.pdf',\n",
      "             )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Username validation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Checking for a valid user name that has a certain minimum and maximum length.\n",
      "\n",
      "Allowed characters:\n",
      "- letters (upper- and lower-case)\n",
      "- numbers\n",
      "- dashes\n",
      "- underscores"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "min_len = 5 # minimum length for a valid username\n",
      "max_len = 15 # maximum length for a valid username\n",
      "\n",
      "pattern = r\"^(?i)[a-z0-9_-]{%s,%s}$\" %(min_len, max_len)\n",
      "\n",
      "# remove `(?i)` to only allow lower-case letters\n",
      "\n",
      "\n",
      "\n",
      "str_true = ('user123', '123_user', 'Username')\n",
      "            \n",
      "str_false = ('user', 'username1234_is-way-too-long', 'user$34354')\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Checking for valid email addresses"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A regular expression that captures most email addresses."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = r\"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$)\"\n",
      "\n",
      "str_true = ('test@mail.com',)\n",
      "            \n",
      "str_false = ('testmail.com', '@testmail.com', 'test@mailcom')\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<font size=\"1px\">source: [http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address)</font>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Check for a valid URL"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Checks for an URL if a string ...\n",
      "\n",
      "- starts with `https://`, or `http://`, or `www.`\n",
      "- or ends with a dot extension"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^(https?:\\/\\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\\/\\w \\.-]*)*\\/?$'\n",
      "\n",
      "str_true = ('https://github.com', \n",
      "            'http://github.com',\n",
      "            'www.github.com',\n",
      "            'github.com',\n",
      "            'test.de',\n",
      "            'https://github.com/rasbt',\n",
      "            'test.jpeg' # !!! \n",
      "            )\n",
      "            \n",
      "str_false = ('testmailcom', 'http:testmailcom', )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<font size=\"1px\">source: [http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149](http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149)</font>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Checking for integers"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Positive integers"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^\\d+$'\n",
      "\n",
      "str_true = ('123', '1', )\n",
      "            \n",
      "str_false = ('abc', '1.1', )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Negative integers"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^-\\d+$'\n",
      "\n",
      "str_true = ('-123', '-1', )\n",
      "            \n",
      "str_false = ('123', '-abc', '-1.1', )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 10
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "All integers"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^-{0,1}\\d+$'\n",
      "\n",
      "str_true = ('-123', '-1', '1', '123',)\n",
      "            \n",
      "str_false = ('123.0', '-abc', '-1.1', )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Positive numbers"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^\\d*\\.{0,1}\\d+$'\n",
      "\n",
      "str_true = ('1', '123', '1.234', )\n",
      "            \n",
      "str_false = ('-abc', '-123', '-123.0')\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 12
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Negative numbers"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^-\\d*\\.{0,1}\\d+$'\n",
      "\n",
      "str_true = ('-1', '-123', '-123.0', )\n",
      "            \n",
      "str_false = ('-abc', '1', '123', '1.234', )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "All numbers"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^-{0,1}\\d*\\.{0,1}\\d+$'\n",
      "\n",
      "str_true = ('1', '123', '1.234', '-123', '-123.0')\n",
      "            \n",
      "str_false = ('-abc')\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<font size=\"1px\">source: [http://stackoverflow.com/questions/1449817/what-are-some-of-the-most-useful-regular-expressions-for-programmers](http://stackoverflow.com/questions/1449817/what-are-some-of-the-most-useful-regular-expressions-for-programmers)</font>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Validating dates"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Validates dates in `mm/dd/yyyy` format."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = '^(0[1-9]|1[0-2])\\/(0[1-9]|1\\d|2\\d|3[01])\\/(19|20)\\d{2}$'\n",
      "\n",
      "str_true = ('01/08/2014', '12/30/2014', )\n",
      "            \n",
      "str_false = ('22/08/2014', '-123', '1/8/2014', '1/08/2014', '01/8/2014')\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<br>\n",
      "<br>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Time"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "[[back to top](#Sections)]"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "12-Hour format"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = r'^(1[012]|[1-9]):[0-5][0-9](\\s)?(?i)(am|pm)$'\n",
      "\n",
      "str_true = ('2:00pm', '7:30 AM', '12:05 am', )\n",
      "            \n",
      "str_false = ('22:00pm', '14:00', '3:12', '03:12pm', )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 29
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "24-Hour format"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pattern = r'^([0-1]{1}[0-9]{1}|20|21|22|23):[0-5]{1}[0-9]{1}$'\n",
      "\n",
      "str_true = ('14:00', '00:30', )\n",
      "            \n",
      "str_false = ('22:00pm', '4:00', )\n",
      "\n",
      "for t in str_true:\n",
      "    assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n",
      "\n",
      "for f in str_false:\n",
      "    assert(bool(re.match(pattern, f)) == False), '%s is not False' %f"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 33
    }
   ],
   "metadata": {}
  }
 ]
}