I’ve been doing some pro bono data and statistics work for a nonprofit focused on keeping constituents informed of how well their elected representatives are representing them between elections. The goal is to allow users to vote along with their representatives and to receive information on how aligned their views are with the votes of their representatives.
We are trying to maintain and analyze data from a number of different sources: about the users (e.g. congressional district, join date); about the bills being voted on; about user engagement; and about numerous other topics. In most of my past work, I was given a dataset to analyze. Now I’m faced with the challenge of building up a database to hopefully maintain and use well into the future.
To that end, I’ve set up a (local) PostgreSQL
database. I’ve had passing encounters with SQL
in the past, but this is my first time working with it in any serious capacity. I’m starting from scratch. I mostly followed the Official PostgreSQL Tutorial, which provides something of a “quick start” guide. I pulled information and tips from a bunch of other sources over the course of setting up the database. This is my attempt to organize that information so I can reproduce it in the future if needed.
Installation
I am using Arch Linux, so the installation and setup notes are specific to that context. I mostly followed the materials posted on the Arch Linux wiki.
-
First, install the
PostgreSQL
packagesudo pacman -S PostgreSQL
-
Switch to the PostgreSQL user
$ su $ su -l postgres
-
Initialize the database cluster
$ initdb -D /var/lib/postgres/data
-
Return to regular user (with
exit
or<C-d>
) -
start
andenable
thepostgresql.service
systemctl start postgresql.service systemctl enable postgresql.service
-
Make a user
createuser --interactive
-
Make a database
We use the
-O
flag to assign the owner of the database. In the previous step, I created a user nameddan
. I assigned the new database (<database-name>
) todan
.createdb <database-name> -O dan
Basic Operations with psql
The postgresql
package comes with psql
, a PostgreSQL interactive terminal. psql
allows us to interactively work with our PostgreSQL databases – we can create, modify, and query databases from the psql
terminal. It also provides various tools for working with scripts and for automating commonly-used procedures.
We can access a given database, from bash, as follows:
psql -d <database-name> -U dan
psql
provides many useful tools that are not part of the SQL
language. A sampling of useful utilities include:
\help
: get help\c <database>
: connect to database\du
: list users and permissions\q
or<C-d>
: exit thepsql
shell\?
: list other meta-commands
Creating a Table
Tables can be created as follows (the below is written in SQL
and is not a psql
terminal command, though it can be entered directly into the psql
terminal).
CREATE TABLE <table-name> (
col1 varchar(80), -- variable-length character
col2 int, -- integer
col3 timestamp, -- date/time
col4 real, -- real number
col5 date -- date
);
Each line above includes the column name (e.g. col1
), the variable type (e.g. varchar(80)
), and a comment (text following --
). This basic syntax creates an empty table within our database.
Adding Some Data to the Table
There are numerous ways to enter data into the table. We can make use of the SQL
INSERT INTO
commands:
INSERT INTO <table-name> VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');
This will fill in one row of data. It assumes we have positioned the data according to the order of the columns in the table. We can also specify exactly which columns we are filling (for example, if we don’t know the order):
INSERT INTO <table-name> (col1, col2, col3, col4, col5)
VALUES ('San Francisco', 46, '2014-11-17 01:22:50', 0.25, '1997-01-13');
In my case, I had some CSV
files I wanted to load into the database. This can be accomplished with the SQL
COPY
command or with the psql
\copy
command. I opted to use the \copy
approach, following a lot of advice I found from various sources online.
\copy <table-name>(col1, col2, col3, col4, col5) FROM '/path/to/file/' DELIMITER ',' CSV HEADER;
We specify the (already-created but as-of-yet unpopulated) table in our database to which we want to import the data first, and the file source second. By noting that the CSV
has a header, we tell the \copy
command to omit the first row.
Concluding Notes
The above represents my very first steps into the world of database design and SQL
. Much more will follow. I have set up several tables in this database have connected them with primary and foreign keys – a concept I will explore further later. I plan, in the near future, to figure out how to interact with this database using python
and/or R
; how to set up easily-reproducible queries to answer common questions of interest; and how, in general, to make the best use of the database in a data science and statistics workflow.