I’ve been doing some pro bono data and statistics work for a nonprofit focused on keeping constituents informed of how well their elected representatives are representing them between elections. The goal is to allow users to vote along with their representatives and to receive information on how aligned their views are with the votes of their representatives.
We are trying to maintain and analyze data from a number of different sources: about the users (e.g. congressional district, join date); about the bills being voted on; about user engagement; and about numerous other topics. In most of my past work, I was given a dataset to analyze. Now I’m faced with the challenge of building up a database to hopefully maintain and use well into the future.
To that end, I’ve set up a (local) PostgreSQL database. I’ve had passing encounters with SQL in the past, but this is my first time working with it in any serious capacity. I’m starting from scratch. I mostly followed the Official PostgreSQL Tutorial, which provides something of a “quick start” guide. I pulled information and tips from a bunch of other sources over the course of setting up the database. This is my attempt to organize that information so I can reproduce it in the future if needed.
Installation
I am using Arch Linux, so the installation and setup notes are specific to that context. I mostly followed the materials posted on the Arch Linux wiki.
-
First, install the
PostgreSQLpackagesudo pacman -S PostgreSQL -
Switch to the PostgreSQL user
$ su $ su -l postgres -
Initialize the database cluster
$ initdb -D /var/lib/postgres/data -
Return to regular user (with
exitor<C-d>) -
startandenablethepostgresql.servicesystemctl start postgresql.service systemctl enable postgresql.service -
Make a user
createuser --interactive -
Make a database
We use the
-Oflag to assign the owner of the database. In the previous step, I created a user nameddan. I assigned the new database (<database-name>) todan.createdb <database-name> -O dan
Basic Operations with psql
The postgresql package comes with psql, a PostgreSQL interactive terminal. psql allows us to interactively work with our PostgreSQL databases – we can create, modify, and query databases from the psql terminal. It also provides various tools for working with scripts and for automating commonly-used procedures.
We can access a given database, from bash, as follows:
psql -d <database-name> -U dan
psql provides many useful tools that are not part of the SQL language. A sampling of useful utilities include:
\help: get help\c <database>: connect to database\du: list users and permissions\qor<C-d>: exit thepsqlshell\?: list other meta-commands
Creating a Table
Tables can be created as follows (the below is written in SQL and is not a psql terminal command, though it can be entered directly into the psql terminal).
CREATE TABLE <table-name> (
col1 varchar(80), -- variable-length character
col2 int, -- integer
col3 timestamp, -- date/time
col4 real, -- real number
col5 date -- date
);
Each line above includes the column name (e.g. col1), the variable type (e.g. varchar(80)), and a comment (text following --). This basic syntax creates an empty table within our database.
Adding Some Data to the Table
There are numerous ways to enter data into the table. We can make use of the SQL INSERT INTO commands:
INSERT INTO <table-name> VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');
This will fill in one row of data. It assumes we have positioned the data according to the order of the columns in the table. We can also specify exactly which columns we are filling (for example, if we don’t know the order):
INSERT INTO <table-name> (col1, col2, col3, col4, col5)
VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');
In my case, I had some CSV files I wanted to load into the database. This can be accomplished with the SQL COPY command or with the psql \copy command. I opted to use the \copy approach, following a lot of advice I found from various sources online.
\copy <table-name>(col1, col2, col3, col4, col5) FROM '/path/to/file/' DELIMITER ',' CSV HEADER;
We specify the (already-created but as-of-yet unpopulated) table in our database to which we want to import the data first, and the file source second. By noting that the CSV has a header, we tell the \copy command to omit the first row.
Concluding Notes
The above represents my very first steps into the world of database design and SQL. Much more will follow. I have set up several tables in this database have connected them with primary and foreign keys – a concept I will explore further later. I plan, in the near future, to figure out how to interact with this database using python and/or R; how to set up easily-reproducible queries to answer common questions of interest; and how, in general, to make the best use of the database in a data science and statistics workflow.