Setting Up a PostgreSQL Database

2019-02-23
4 min read

I’ve been doing some pro bono data and statistics work for a nonprofit focused on keeping constituents informed of how well their elected representatives are representing them between elections. The goal is to allow users to vote along with their representatives and to receive information on how aligned their views are with the votes of their representatives.

We are trying to maintain and analyze data from a number of different sources: about the users (e.g. congressional district, join date); about the bills being voted on; about user engagement; and about numerous other topics. In most of my past work, I was given a dataset to analyze. Now I’m faced with the challenge of building up a database to hopefully maintain and use well into the future.

To that end, I’ve set up a (local) PostgreSQL database. I’ve had passing encounters with SQL in the past, but this is my first time working with it in any serious capacity. I’m starting from scratch. I mostly followed the Official PostgreSQL Tutorial, which provides something of a “quick start” guide. I pulled information and tips from a bunch of other sources over the course of setting up the database. This is my attempt to organize that information so I can reproduce it in the future if needed.

Installation

I am using Arch Linux, so the installation and setup notes are specific to that context. I mostly followed the materials posted on the Arch Linux wiki.

  1. First, install the PostgreSQL package

    sudo pacman -S PostgreSQL
    
  2. Switch to the PostgreSQL user

    $ su
    $ su -l postgres
    
  3. Initialize the database cluster

    $ initdb -D /var/lib/postgres/data
    
  4. Return to regular user (with exit or <C-d>)

  5. start and enable the postgresql.service

    systemctl start postgresql.service
    systemctl enable postgresql.service
    
  6. Make a user

    documentation

    createuser --interactive
    
  7. Make a database

    documentation

    We use the -O flag to assign the owner of the database. In the previous step, I created a user named dan. I assigned the new database (<database-name>) to dan.

    createdb <database-name> -O dan
    

Basic Operations with psql

The postgresql package comes with psql, a PostgreSQL interactive terminal. psql allows us to interactively work with our PostgreSQL databases – we can create, modify, and query databases from the psql terminal. It also provides various tools for working with scripts and for automating commonly-used procedures.

We can access a given database, from bash, as follows:

psql -d <database-name> -U dan

psql provides many useful tools that are not part of the SQL language. A sampling of useful utilities include:

  • \help: get help
  • \c <database>: connect to database
  • \du: list users and permissions
  • \q or <C-d>: exit the psql shell
  • \?: list other meta-commands

Creating a Table

Tables can be created as follows (the below is written in SQL and is not a psql terminal command, though it can be entered directly into the psql terminal).

CREATE TABLE <table-name> (
col1            varchar(80),  -- variable-length character
col2            int,          -- integer
col3            timestamp,    -- date/time
col4            real,         -- real number
col5            date          -- date
 );

Each line above includes the column name (e.g. col1), the variable type (e.g. varchar(80)), and a comment (text following --). This basic syntax creates an empty table within our database.

Adding Some Data to the Table

There are numerous ways to enter data into the table. We can make use of the SQL INSERT INTO commands:

INSERT INTO <table-name> VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');

This will fill in one row of data. It assumes we have positioned the data according to the order of the columns in the table. We can also specify exactly which columns we are filling (for example, if we don’t know the order):

INSERT INTO <table-name> (col1, col2, col3, col4, col5)
  VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');

In my case, I had some CSV files I wanted to load into the database. This can be accomplished with the SQL COPY command or with the psql \copy command. I opted to use the \copy approach, following a lot of advice I found from various sources online.

\copy <table-name>(col1, col2, col3, col4, col5) FROM '/path/to/file/' DELIMITER ',' CSV HEADER;

We specify the (already-created but as-of-yet unpopulated) table in our database to which we want to import the data first, and the file source second. By noting that the CSV has a header, we tell the \copy command to omit the first row.

Concluding Notes

The above represents my very first steps into the world of database design and SQL. Much more will follow. I have set up several tables in this database have connected them with primary and foreign keys – a concept I will explore further later. I plan, in the near future, to figure out how to interact with this database using python and/or R; how to set up easily-reproducible queries to answer common questions of interest; and how, in general, to make the best use of the database in a data science and statistics workflow.