python_reference

mirror of https://github.com/rasbt/python_reference.git synced 2025-05-13 04:45:39 +00:00

History

There are small typos in:
- tutorials/installing_scientific_packages.md
- tutorials/sqlite3_howto/README.md
- tutorials/sqlite3_howto/code/update_or_insert_records.py
- useful_scripts/conc_gzip_files.py

Fixes:
- Should read `existing` rather than `exisiting`.
- Should read `conveniently` rather than `conviniently`.
- Should read `calculate` rather than `calulate`.
- Should read `accommodate` rather than `accomodate`.

2022-06-27 13:19:05 +10:00

code

docs: Fix a few typos

2022-06-27 13:19:05 +10:00

LICENSE

reorganization

2014-06-01 12:59:24 -04:00

README.md

docs: Fix a few typos

2022-06-27 13:19:05 +10:00

README.md

A thorough guide to SQLite database operations in Python

-- written by Sebastian Raschka on March 7, 2014

Sections

• Connecting to an SQLite database
• Creating a new SQLite database
- Overview of SQLite data types
- A quick word on PRIMARY KEYS:
• Adding new columns
• Inserting and updating rows
• Creating unique indexes
• Querying the database - Selecting rows
• Security and injection attacks
• Date and time operations
• Printing a database summary
• Conclusion

The complete Python code that I am using in this tutorial can be downloaded from my GitHub repository: https://github.com/rasbt/python_reference/tree/master/tutorials/sqlite3_howto

Connecting to an SQLite database

The sqlite3 that we will be using throughout this tutorial is part of the Python Standard Library and is a nice and easy interface to SQLite databases: There are no server processes involved, no configurations required, and no other obstacles we have to worry about.

In general, the only thing that needs to be done before we can perform any operation on a SQLite database via Python's sqlite3 module, is to open a connection to an SQLite database file:

import sqlite3
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

where the database file (sqlite_file) can reside anywhere on our disk, e.g.,

sqlite_file = '/Users/Sebastian/Desktop/my_db.sqlite'

Conveniently, a new database file (.sqlite file) will be created automatically the first time we try to connect to a database. However, we have to be aware that it won't have a table, yet. In the following section, we will take a look at some example code of how to create a new SQLite database files with tables for storing some data.

To round up this section about connecting to a SQLite database file, there are two more operations that are worth mentioning. If we are finished with our operations on the database file, we have to close the connection via the .close() method:

conn.close()

And if we performed any operation on the database other than sending queries, we need to commit those changes via the .commit() method before we close the connection:

conn.commit()
conn.close()

Creating a new SQLite database

Let us have a look at some example code to create a new SQLite database file with two tables: One with and one without a PRIMARY KEY column (don't worry, there is more information about PRIMARY KEYs further down in this section).

import sqlite3

sqlite_file = 'my_first_db.sqlite'    # name of the sqlite database file
table_name1 = 'my_table_1'  # name of the table to be created
table_name2 = 'my_table_2'  # name of the table to be created
new_field = 'my_1st_column' # name of the column
field_type = 'INTEGER'  # column data type

# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

# Creating a new SQLite table with 1 column
c.execute('CREATE TABLE {tn} ({nf} {ft})'\
        .format(tn=table_name1, nf=new_field, ft=field_type))

# Creating a second table with 1 column and set it as PRIMARY KEY
# note that PRIMARY KEY column must consist of unique values!
c.execute('CREATE TABLE {tn} ({nf} {ft} PRIMARY KEY)'\
        .format(tn=table_name2, nf=new_field, ft=field_type))

# Committing changes and closing the connection to the database file
conn.commit()
conn.close()

Download the script: create_new_db.py

Tip: A handy tool to visualize and access SQLite databases is the free FireFox SQLite Manager add-on. Throughout this article, I will use this tool to provide screenshots of the database structures that we created below the corresponding code sections.

Using the code above, we created a new .sqlite database file with 2 tables. Each table consists of currently one column only, which is of type INTEGER.

Here is a quick overview of all data types that are supported by SQLite 3:

INTEGER: A signed integer up to 8 bytes depending on the magnitude of the value.
REAL: An 8-byte floating point value.
TEXT: A text string, typically UTF-8 encoded (depending on the database encoding).
BLOB: A blob of data (binary large object) for storing binary data.
NULL: A NULL value, represents missing data or an empty cell.

Looking at the table above, You might have noticed that SQLite 3 has no designated Boolean data type. However, this should not be an issue, since we could simply re-purpose the INTEGER type to represent Boolean values (0 = false, 1 = true).

A quick word on PRIMARY KEYS:
In our example code above, we set our 1 column in the second table to PRIMARY KEY. The advantage of a PRIMARY KEY index is a significant performance gain if we use the PRIMARY KEY column as query for accessing rows in the table. Every table can only have max. 1 PRIMARY KEY (single or multiple column(s)), and the values in this column MUST be unique! But more on column indexing in the a later section.

Adding new columns

If we want to add a new column to an existing SQLite database table, we can either leave the cells for each row empty (NULL value), or we can set a default value for each cell, which is pretty convenient for certain applications.
Let's have a look at some code:

import sqlite3

sqlite_file = 'my_first_db.sqlite'    # name of the sqlite database file
table_name = 'my_table_2'   # name of the table to be created
id_column = 'my_1st_column' # name of the PRIMARY KEY column
new_column1 = 'my_2nd_column'  # name of the new column
new_column2 = 'my_3nd_column'  # name of the new column
column_type = 'TEXT' # E.g., INTEGER, TEXT, NULL, REAL, BLOB
default_val = 'Hello World' # a default value for the new column rows

# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

# A) Adding a new column without a row value
c.execute("ALTER TABLE {tn} ADD COLUMN '{cn}' {ct}"\
        .format(tn=table_name, cn=new_column1, ct=column_type))

# B) Adding a new column with a default row value
c.execute("ALTER TABLE {tn} ADD COLUMN '{cn}' {ct} DEFAULT '{df}'"\
        .format(tn=table_name, cn=new_column2, ct=column_type, df=default_val))

# Committing changes and closing the connection to the database file
conn.commit()
conn.close()

Download the script: add_new_column.py

We just added 2 more columns (my_2nd_column and my_3rd_column) to my_table_2 of our SQLite database next to the PRIMARY KEY column my_1st_column.
The difference between the two new columns is that we initialized my_3rd_column with a default value (here:'Hello World'), which will be inserted for every existing cell under this column and for every new row that we are going to add to the table if we don't insert or update it with a different value.

Inserting and updating rows

Inserting and updating rows into an existing SQLite database table - next to sending queries - is probably the most common database operation. The Structured Query Language has a convenient UPSERT function, which is basically just a merge between UPDATE and INSERT: It inserts new rows into a database table with a value for the PRIMARY KEY column if it does not exist yet, or updates a row for an existing PRIMARY KEY value. Unfortunately, this convenient syntax is not supported by the more compact SQLite database implementation that we are using here. However, there are some workarounds. But let us first have a look at the example code:

import sqlite3

sqlite_file = 'my_first_db.sqlite'
table_name = 'my_table_2'
id_column = 'my_1st_column'
column_name = 'my_2nd_column'

# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

# A) Inserts an ID with a specific value in a second column 
try:
    c.execute("INSERT INTO {tn} ({idf}, {cn}) VALUES (123456, 'test')".\
        format(tn=table_name, idf=id_column, cn=column_name))
except sqlite3.IntegrityError:
    print('ERROR: ID already exists in PRIMARY KEY column {}'.format(id_column))

# B) Tries to insert an ID (if it does not exist yet)
# with a specific value in a second column 
c.execute("INSERT OR IGNORE INTO {tn} ({idf}, {cn}) VALUES (123456, 'test')".\
        format(tn=table_name, idf=id_column, cn=column_name))

# C) Updates the newly inserted or pre-existing entry            
c.execute("UPDATE {tn} SET {cn}=('Hi World') WHERE {idf}=(123456)".\
        format(tn=table_name, cn=column_name, idf=id_column))

conn.commit()
conn.close()

Download the script: update_or_insert_records.py

Both A) INSERT and B) INSERT OR IGNORE have in common that they append new rows to the database if a given PRIMARY KEY does not exist in the database table, yet. However, if we'd try to append a PRIMARY KEY value that is not unique, a simple INSERT would raise an sqlite3.IntegrityError exception, which can be either captured via a try-except statement (case A) or circumvented by the SQLite call INSERT OR IGNORE (case B). This can be pretty useful if we want to construct an UPSERT equivalent in SQLite. E.g., if we want to add a dataset to an existing database table that contains a mix between existing and new IDs for our PRIMARY KEY column.

Creating unique indexes

Just like hashtable-datastructures, indexes function as direct pointers to our data in a table for a particular column (i.e., the indexed column). For example, the PRIMARY KEY column would have such an index by default. The downside of indexes is that every row value in the column must be unique. However, it is recommended and pretty useful to index certain columns if possible, since it rewards us with a significant performance gain for the data retrieval.
The example code below shows how to add such an unique index to an existing column in an SQLite database table. And if we should decide to insert non- unique values into a indexed column later, there is also a convenient way to drop the index, which is also shown in the code below.

import sqlite3

sqlite_file = 'my_first_db.sqlite'    # name of the sqlite database file
table_name = 'my_table_2'   # name of the table to be created
id_column = 'my_1st_column' # name of the PRIMARY KEY column
new_column = 'unique_names'  # name of the new column
column_type = 'TEXT' # E.g., INTEGER, TEXT, NULL, REAL, BLOB
index_name = 'my_unique_index'  # name for the new unique index

# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

# Adding a new column and update some record
c.execute("ALTER TABLE {tn} ADD COLUMN '{cn}' {ct}"\
        .format(tn=table_name, cn=new_column, ct=column_type))
c.execute("UPDATE {tn} SET {cn}='sebastian_r' WHERE {idf}=123456".\
        format(tn=table_name, idf=id_column, cn=new_column))

# Creating an unique index
c.execute('CREATE INDEX {ix} on {tn}({cn})'\
        .format(ix=index_name, tn=table_name, cn=new_column))

# Dropping the unique index
# E.g., to avoid future conflicts with update/insert functions
c.execute('DROP INDEX {ix}'.format(ix=index_name))

# Committing changes and closing the connection to the database file
conn.commit()
conn.close()

Download the script: create_unique_index.py

Querying the database - Selecting rows

After we learned about how to create and modify SQLite databases, it's about time for some data retrieval. The code below illustrates how we can retrieve row entries for all or some columns if they match certain criteria.

import sqlite3

sqlite_file = 'my_first_db.sqlite'    # name of the sqlite database file
table_name = 'my_table_2'   # name of the table to be queried
id_column = 'my_1st_column'
some_id = 123456
column_2 = 'my_2nd_column'
column_3 = 'my_3rd_column'    

# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

# 1) Contents of all columns for row that match a certain value in 1 column
c.execute('SELECT * FROM {tn} WHERE {cn}="Hi World"'.\
        format(tn=table_name, cn=column_2))
all_rows = c.fetchall()
print('1):', all_rows)

# 2) Value of a particular column for rows that match a certain value in column_1 
c.execute('SELECT ({coi}) FROM {tn} WHERE {cn}="Hi World"'.\
        format(coi=column_2, tn=table_name, cn=column_2))
all_rows = c.fetchall()
print('2):', all_rows)

# 3) Value of 2 particular columns for rows that match a certain value in 1 column
c.execute('SELECT {coi1},{coi2} FROM {tn} WHERE {coi1}="Hi World"'.\
        format(coi1=column_2, coi2=column_3, tn=table_name, cn=column_2))
all_rows = c.fetchall()
print('3):', all_rows)

# 4) Selecting only up to 10 rows that match a certain value in 1 column
c.execute('SELECT * FROM {tn} WHERE {cn}="Hi World" LIMIT 10'.\
        format(tn=table_name, cn=column_2))
ten_rows = c.fetchall()
print('4):', ten_rows)

# 5) Check if a certain ID exists and print its column contents
c.execute("SELECT * FROM {tn} WHERE {idf}={my_id}".\
        format(tn=table_name, cn=column_2, idf=id_column, my_id=some_id))
id_exists = c.fetchone()
if id_exists:
    print('5): {}'.format(id_exists))
else:
    print('5): {} does not exist'.format(some_id))

# Closing the connection to the database file
conn.close()

Download the script: selecting_entries.py

if we use the .fetchall() method, we return a list of tuples from the database query, where each tuple represents one row entry. The print output for the 5 different cases shown in the code above would look like this (note that we only have a table with 1 row here):

Security and injection attacks

So far, we have been using Python's string formatting method to insert parameters like table and column names into the c.execute() functions. This is fine if we just want to use the database for ourselves. However, this leaves our database vulnerable to injection attacks. For example, if our database would be part of a web application, it would allow hackers to directly communicate with the database in order to bypass login and password verification and steal data.
In order to prevent this, it is recommended to use ? place holders in the SQLite commands instead of the % formatting expression or the .format() method, which we have been using in this tutorial.
For example, instead of using

# 5) Check if a certain ID exists and print its column contents
c.execute("SELECT * FROM {tn} WHERE {idf}={my_id}".\
        format(tn=table_name, cn=column_2, idf=id_column, my_id=some_id))

in the Querying the database - Selecting rows section above, we would want to use the ? placeholder for the queried column value and include the variable(s) (here: 123456), which we want to insert, as tuple at the end of the c.execute() string.

# 5) Check if a certain ID exists and print its column contents
c.execute("SELECT * FROM {tn} WHERE {idf}=?".\
        format(tn=table_name, cn=column_2, idf=id_column), (123456,))

However, the problem with this approach is that it would only work for values, not for column or table names. So what are we supposed to do with the rest of the string if we want to protect ourselves from injection attacks? The easy solution would be to refrain from using variables in SQLite queries whenever possible, and if it cannot be avoided, we would want to use a function that strips all non-alphanumerical characters from the stored content of the variable, e.g.,

def clean_name(some_var):
    return ''.join(char for char in some_var if char.isalnum())

Date and time operations

SQLite inherited the convenient date and time operations from SQL, which are one of my favorite features of the Structured Query Language: It does not only allow us to insert dates and times in various different formats, but we can also perform simple + and - arithmetic, for example to look up entries that have been added xxx days ago.

import sqlite3

sqlite_file = 'my_first_db.sqlite'    # name of the sqlite database file
table_name = 'my_table_3'   # name of the table to be created
id_field = 'id' # name of the ID column
date_col = 'date' # name of the date column
time_col = 'time'# name of the time column
date_time_col = 'date_time' # name of the date & time column
field_type = 'TEXT'  # column data type

# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

# Creating a new SQLite table with 1 column
c.execute('CREATE TABLE {tn} ({fn} {ft} PRIMARY KEY)'\
        .format(tn=table_name, fn=id_field, ft=field_type))

# A) Adding a new column to save date insert a row with the current date 
# in the following format: YYYY-MM-DD
# e.g., 2014-03-06
c.execute("ALTER TABLE {tn} ADD COLUMN '{cn}'"\
         .format(tn=table_name, cn=date_col))
# insert a new row with the current date and time, e.g., 2014-03-06
c.execute("INSERT INTO {tn} ({idf}, {cn}) VALUES('some_id1', DATE('now'))"\
         .format(tn=table_name, idf=id_field, cn=date_col))

# B) Adding a new column to save date and time and update with the current time
# in the following format: HH:MM:SS
# e.g., 16:26:37
c.execute("ALTER TABLE {tn} ADD COLUMN '{cn}'"\
         .format(tn=table_name, cn=time_col))
# update row for the new current date and time column, e.g., 2014-03-06 16:26:37
c.execute("UPDATE {tn} SET {cn}=TIME('now') WHERE {idf}='some_id1'"\
         .format(tn=table_name, idf=id_field, cn=time_col))

# C) Adding a new column to save date and time and update with current date-time
# in the following format: YYYY-MM-DD HH:MM:SS
# e.g., 2014-03-06 16:26:37
c.execute("ALTER TABLE {tn} ADD COLUMN '{cn}'"\
         .format(tn=table_name, cn=date_time_col))
# update row for the new current date and time column, e.g., 2014-03-06 16:26:37
c.execute("UPDATE {tn} SET {cn}=(CURRENT_TIMESTAMP) WHERE {idf}='some_id1'"\
         .format(tn=table_name, idf=id_field, cn=date_time_col))

# The database should now look like this:
# id         date           time        date_time
# "some_id1" "2014-03-06"   "16:42:30"  "2014-03-06 16:42:30"

# 4) Retrieve all IDs of entries between 2 date_times
c.execute("SELECT {idf} FROM {tn} WHERE {cn} BETWEEN '2013-03-06 10:10:10' AND '2015-03-06 10:10:10'".\
    format(idf=id_field, tn=table_name, cn=date_time_col))
all_date_times = c.fetchall()
print('4) all entries between ~2013 - 2015:', all_date_times)

# 5) Retrieve all IDs of entries between that are older than 1 day and 12 hrs
c.execute("SELECT {idf} FROM {tn} WHERE DATE('now') - {dc} >= 1 AND DATE('now') - {tc} >= 12".\
    format(idf=id_field, tn=table_name, dc=date_col, tc=time_col))
all_1day12hrs_entries = c.fetchall()
print('5) entries older than 1 day:', all_1day12hrs_entries)

# Committing changes and closing the connection to the database file
conn.commit()
conn.close()

Download the script: date_time_ops.py

Some of the really convenient functions that return the current time and date are:

DATE('now') # returns current date, e.g., 2014-03-06
TIME('now') # returns current time, e.g., 10:10:10
CURRENT_TIMESTAMP # returns current date and time, e.g., 2014-03-06 16:42:30
#  (or alternatively: DATETIME('now'))

The screenshot below shows the print outputs of the code that we used to query for entries that lie between a specified date interval using

BETWEEN '2013-03-06 10:10:10' AND '2015-03-06 10:10:10'

and entries that are older than 1 day via

WHERE DATE('now') - some_date

Note that we don't have to provide the complete time stamps here, the same syntax applies to simple dates or simple times only, too.

Update Mar 16, 2014:

If'd we are interested to calculate the hours between two DATETIME() timestamps, we can could use the handy STRFTIME() function like this

SELECT (STRFTIME('%s','2014-03-14 14:51:00') - STRFTIME('%s','2014-03-16 14:51:00'))  
 / -3600

which would calculate the difference in hours between two dates in this particular example above (here: 48) in this case.
And to calculate the difference in hours between the current DATETIME and a given DATETIME string, we could use the following SQLite syntax:

SELECT (STRFTIME('%s',DATETIME('now')) - STRFTIME('%s','2014-03-15 14:51:00')) / 3600

Retrieving column names

In the previous two sections we have seen how we query SQLite databases for data contents. Now let us have a look at how we retrieve its metadata (here: column names):

import sqlite3

sqlite_file = 'my_first_db.sqlite'
table_name = 'my_table_3'

# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

# Retrieve column information
# Every column will be represented by a tuple with the following attributes:
# (id, name, type, notnull, default_value, primary_key)
c.execute('PRAGMA TABLE_INFO({})'.format(table_name))

# collect names in a list
names = [tup[1] for tup in c.fetchall()]
print(names)
# e.g., ['id', 'date', 'time', 'date_time']

# Closing the connection to the database file
conn.close()

Download the script: get_columnnames.py

Since we haven't created a PRIMARY KEY column for my_table_3, SQLite automatically provides an indexed rowid column with unique ascending integer values, which will be ignored in our case. Using the PRAGMA TABLE_INFO() function on our table, we return a list of tuples, where each tuple contains the following information about every column in the table: (id, name, type, notnull, default_value, primary_key).
So, in order to get the names of every column in our table, we only have to grab the 2nd value in each tuple of the returned list, which can be done by

names = [tup[1] for tup in c.fetchall()]

after the PRAGMA TABLE_INFO() call. If we would print the contents of the variable names now, the output would look like this:

Printing a database summary

I hope we covered most of the basics about SQLite database operations in the previous sections, and by now we should be well equipped to get some serious work done using SQLite in Python.
Let me conclude this tutorial with an obligatory "last but not least" and a convenient script to print a nice overview of SQLite database tables:

import sqlite3


def connect(sqlite_file):
    """ Make connection to an SQLite database file """
    conn = sqlite3.connect(sqlite_file)
    c = conn.cursor()
    return conn, c


def close(conn):
    """ Commit changes and close connection to the database """
    # conn.commit()
    conn.close()


def total_rows(cursor, table_name, print_out=False):
    """ Returns the total number of rows in the database """
    cursor.execute('SELECT COUNT(*) FROM {}'.format(table_name))
    count = cursor.fetchall()
    if print_out:
        print('\nTotal rows: {}'.format(count[0][0]))
    return count[0][0]


def table_col_info(cursor, table_name, print_out=False):
    """ Returns a list of tuples with column informations:
    (id, name, type, notnull, default_value, primary_key)
    """
    cursor.execute('PRAGMA TABLE_INFO({})'.format(table_name))
    info = cursor.fetchall()

    if print_out:
        print("\nColumn Info:\nID, Name, Type, NotNull, DefaultVal, PrimaryKey")
        for col in info:
            print(col)
    return info


def values_in_col(cursor, table_name, print_out=True):
    """ Returns a dictionary with columns as keys
    and the number of not-null entries as associated values.
    """
    cursor.execute('PRAGMA TABLE_INFO({})'.format(table_name))
    info = cursor.fetchall()
    col_dict = dict()
    for col in info:
        col_dict[col[1]] = 0
    for col in col_dict:
        c.execute('SELECT ({0}) FROM {1} '
                  'WHERE {0} IS NOT NULL'.format(col, table_name))
        # In my case this approach resulted in a
        # better performance than using COUNT
        number_rows = len(c.fetchall())
        col_dict[col] = number_rows
    if print_out:
        print("\nNumber of entries per column:")
        for i in col_dict.items():
            print('{}: {}'.format(i[0], i[1]))
    return col_dict


if __name__ == '__main__':

    sqlite_file = 'my_first_db.sqlite'
    table_name = 'my_table_3'

    conn, c = connect(sqlite_file)
    total_rows(c, table_name, print_out=True)
    table_col_info(c, table_name, print_out=True)
    # next line might be slow on large databases
    values_in_col(c, table_name, print_out=True)

    close(conn)

Download the script: print_db_info.py

Conclusion

I really hope this tutorial was helpful to you to get started with SQLite database operations via Python. I have been using the sqlite3 module a lot recently, and it has found its way into most of my programs for larger data analyses.
Currently, I am working on a novel drug screening software that requires me to store 3D structures and other functional data for ~13 million chemical compounds, and SQLite has been an invaluable part of my program to quickly store, query, analyze, and share my data.
Another smaller project that uses sqlite3 in Python would be smilite, a module to retrieve and compare SMILE strings of chemical compounds from the free ZINC online database. If you are interested, you can check it out at: https://github.com/rasbt/smilite.

If you have any suggestions or questions, please don't hesitate to write me an email or leave a comment in the comment section below! I am looking forward to your opinions and ideas, and I hope I can improve and extend this tutorial in future.