diff --git a/README.md b/README.md index 0995fe8..99b97ec 100755 --- a/README.md +++ b/README.md @@ -42,6 +42,8 @@ - Awesome things that you can do in IPython Notebooks (in progress) [[IPython nb](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/awesome_things_ipynb.ipynb)] +- A collection of useful regular expressions [[IPython nb](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/tutorials/useful_regex.ipynb)] +
@@ -92,7 +94,7 @@ GitHub repository [One-Python-benchmark-per-day](https://github.com/rasbt/One-Py - Numeric matrix manipulation - The cheat sheet for MATLAB, Python NumPy, R, and Julia [[Markdown](./tutorials/matrix_cheatsheet.md)] -- [Python Book Reviews](./other/python_book_reviews.md) +- Python Book Reviews [[Markdown](./other/python_book_reviews.md)]
diff --git a/tutorials/useful_regex.ipynb b/tutorials/useful_regex.ipynb new file mode 100644 index 0000000..1f4c880 --- /dev/null +++ b/tutorials/useful_regex.ipynb @@ -0,0 +1,826 @@ +{ + "metadata": { + "name": "", + "signature": "sha256:9fd7d5201ce5b97fadad65f2c30cfec993fc83907e04418b032bd1bbdac05ff4" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[Sebastian Raschka](http://sebastianraschka.com) \n", + "\n", + "- [Link to this IPython notebook on Github](https://github.com/rasbt/python_reference/blob/master/tutorials/useful_regex.ipynb) " + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%load_ext watermark" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 1 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%watermark -d -v -u -t -z" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "Last updated: 06/07/2014 10:07:02 EDT\n", + "\n", + "CPython 3.4.1\n", + "IPython 2.1.0\n" + ] + } + ], + "prompt_number": 2 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[More information](http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/ipython_magic/watermark.ipynb) about the `watermark` magic command extension." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "I would be happy to hear your comments and suggestions. \n", + "Please feel free to drop me a note via\n", + "[twitter](https://twitter.com/rasbt), [email](mailto:bluewoodtree@gmail.com), or [google+](https://plus.google.com/+SebastianRaschka).\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 1, + "metadata": {}, + "source": [ + "A collection of useful regular expressions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Sections" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- [About the `re` module](#About-the-re-module)\n", + "- [Identify files via file extensions](#Identify-files-via-file-extensions)\n", + "- [Username validation](#Username-validation)\n", + "- [Checking for valid email addresses](#Checking-for-valid-email-addresses)\n", + "- [Check for a valid URL](#Check-for-a-valid-URL)\n", + "- [Checking for integers](#Checking-for-integers)\n", + "- [Validating dates](#Validating-dates)\n", + "- [Time](#Time)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "About the `re` module" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The purpose of this IPython notebook is not to rewrite a detailed tutorial about regular expressions or the in-built Python `re` module, but to collect some useful regular expressions for copy&paste purposes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The complete documentation of the Python `re` module can be found here [https://docs.python.org/3.4/howto/regex.html](https://docs.python.org/3.4/howto/regex.html). Below, I just want to list the most important methods for convenience:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- `re.match()` : Determine if the RE matches at the beginning of the string.\n", + "- `re.search()` : Scan through a string, looking for any location where this RE matches.\n", + "- `re.findall()` : Find all substrings where the RE matches, and returns them as a list.\n", + "- `re.finditer()` : Find all substrings where the RE matches, and returns them as an iterator." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you are using the same regular expression multiple times, it is recommended to compile it for improved performance.\n", + "\n", + " compiled_re = re.compile(r'some_regexpr') \n", + " for word in text:\n", + " match = comp.search(compiled_re))\n", + " # do something with the match\n", + " \n", + "**E.g., if we want to check if a string ends with a substring:**" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "import re\n", + "\n", + "needle = 'needlers'\n", + "\n", + "# Python approach\n", + "print(bool(any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')])))\n", + "\n", + "# On-the-fly Regular expression in Python\n", + "print(bool(re.search(r'(?:ly|ed|ing|ers)$', needle)))\n", + "\n", + "# Compiled Regular expression in Python\n", + "comp = re.compile(r'(?:ly|ed|ing|ers)$') \n", + "print(bool(comp.search(needle)))" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "True\n", + "True\n", + "True\n" + ] + } + ], + "prompt_number": 3 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%timeit -n 10000 -r 50 bool(any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')]))\n", + "%timeit -n 10000 -r 50 bool(re.search(r'(?:ly|ed|ing|ers)$', needle))\n", + "%timeit -n 10000 -r 50 bool(comp.search(needle))" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10000 loops, best of 50: 2.74 \u00b5s per loop\n", + "10000 loops, best of 50: 2.93 \u00b5s per loop" + ] + }, + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "\n", + "10000 loops, best of 50: 1.28 \u00b5s per loop" + ] + }, + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "\n" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Identify files via file extensions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A regular expression to check for file extensions." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = r'(?i)(\\w+)\\.(jpeg|jpg|png|gif|tif|svg)$'\n", + "\n", + "# remove `(?i)` to make regexpr case-sensitive\n", + "\n", + "str_true = ('test.gif', \n", + " 'image.jpeg', \n", + " 'image.jpg',\n", + " 'image.TIF'\n", + " )\n", + "\n", + "str_false = ('test.pdf',\n", + " 'test.gif.pdf',\n", + " )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Username validation" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Checking for a valid user name that has a certain minimum and maximum length.\n", + "\n", + "Allowed characters:\n", + "- letters (upper- and lower-case)\n", + "- numbers\n", + "- dashes\n", + "- underscores" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "min_len = 5 # minimum length for a valid username\n", + "max_len = 15 # maximum length for a valid username\n", + "\n", + "pattern = r\"^(?i)[a-z0-9_-]{%s,%s}$\" %(min_len, max_len)\n", + "\n", + "# remove `(?i)` to only allow lower-case letters\n", + "\n", + "\n", + "\n", + "str_true = ('user123', '123_user', 'Username')\n", + " \n", + "str_false = ('user', 'username1234_is-way-too-long', 'user$34354')\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Checking for valid email addresses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A regular expression that captures most email addresses." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = r\"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+$)\"\n", + "\n", + "str_true = ('test@mail.com',)\n", + " \n", + "str_false = ('testmail.com', '@testmail.com', 'test@mailcom')\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 7 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "source: [http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Check for a valid URL" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Checks for an URL if a string ...\n", + "\n", + "- starts with `https://`, or `http://`, or `www.`\n", + "- or ends with a dot extension" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^(https?:\\/\\/)?([\\da-z\\.-]+)\\.([a-z\\.]{2,6})([\\/\\w \\.-]*)*\\/?$'\n", + "\n", + "str_true = ('https://github.com', \n", + " 'http://github.com',\n", + " 'www.github.com',\n", + " 'github.com',\n", + " 'test.de',\n", + " 'https://github.com/rasbt',\n", + " 'test.jpeg' # !!! \n", + " )\n", + " \n", + "str_false = ('testmailcom', 'http:testmailcom', )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 8 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "source: [http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149](http://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Checking for integers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "Positive integers" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^\\d+$'\n", + "\n", + "str_true = ('123', '1', )\n", + " \n", + "str_false = ('abc', '1.1', )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 9 + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "Negative integers" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^-\\d+$'\n", + "\n", + "str_true = ('-123', '-1', )\n", + " \n", + "str_false = ('123', '-abc', '-1.1', )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 10 + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "All integers" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^-{0,1}\\d+$'\n", + "\n", + "str_true = ('-123', '-1', '1', '123',)\n", + " \n", + "str_false = ('123.0', '-abc', '-1.1', )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 11 + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "Positive numbers" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^\\d*\\.{0,1}\\d+$'\n", + "\n", + "str_true = ('1', '123', '1.234', )\n", + " \n", + "str_false = ('-abc', '-123', '-123.0')\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 12 + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "Negative numbers" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^-\\d*\\.{0,1}\\d+$'\n", + "\n", + "str_true = ('-1', '-123', '-123.0', )\n", + " \n", + "str_false = ('-abc', '1', '123', '1.234', )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 13 + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "All numbers" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^-{0,1}\\d*\\.{0,1}\\d+$'\n", + "\n", + "str_true = ('1', '123', '1.234', '-123', '-123.0')\n", + " \n", + "str_false = ('-abc')\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 14 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "source: [http://stackoverflow.com/questions/1449817/what-are-some-of-the-most-useful-regular-expressions-for-programmers](http://stackoverflow.com/questions/1449817/what-are-some-of-the-most-useful-regular-expressions-for-programmers)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Validating dates" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Validates dates in `mm/dd/yyyy` format." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = '^(0[1-9]|1[0-2])\\/(0[1-9]|1\\d|2\\d|3[01])\\/(19|20)\\d{2}$'\n", + "\n", + "str_true = ('01/08/2014', '12/30/2014', )\n", + " \n", + "str_false = ('22/08/2014', '-123', '1/8/2014', '1/08/2014', '01/8/2014')\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 15 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "
" + ] + }, + { + "cell_type": "heading", + "level": 2, + "metadata": {}, + "source": [ + "Time" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "[[back to top](#Sections)]" + ] + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "12-Hour format" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = r'^(1[012]|[1-9]):[0-5][0-9](\\s)?(?i)(am|pm)$'\n", + "\n", + "str_true = ('2:00pm', '7:30 AM', '12:05 am', )\n", + " \n", + "str_false = ('22:00pm', '14:00', '3:12', '03:12pm', )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 29 + }, + { + "cell_type": "heading", + "level": 3, + "metadata": {}, + "source": [ + "24-Hour format" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "pattern = r'^([0-1]{1}[0-9]{1}|20|21|22|23):[0-5]{1}[0-9]{1}$'\n", + "\n", + "str_true = ('14:00', '00:30', )\n", + " \n", + "str_false = ('22:00pm', '4:00', )\n", + "\n", + "for t in str_true:\n", + " assert(bool(re.match(pattern, t)) == True), '%s is not True' %t\n", + "\n", + "for f in str_false:\n", + " assert(bool(re.match(pattern, f)) == False), '%s is not False' %f" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 33 + } + ], + "metadata": {} + } + ] +} \ No newline at end of file