PostgreSQL

Created: 2015-08-11 12:17:29 -0700 Modified: 2023-12-17 16:20:55 -0800

Getting started

I wanted to test PostgreSQL out. I did the following steps:

Save this as compose.yml

version: '3.8'
services:
  db:
    image: postgres:16.1
    restart: always
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: bar
      POSTGRES_DB: foo
    volumes:
      - $HOME/tmp/postgres:/var/lib/postgresql/data
    ports:
      - 5432:5432

Run this command: docker compose -f ./compose.yml up
Install pgcli for a command-line client with autocomplete
Run this command to connect to the database: pgcli foo -p 5432 -h localhost -u postgres

Quick examples

Case-insensitive queries:
- select * from command_names where name ilike ('Addcom');
- select * from command_names where LOWER(name) = LOWER('Addcom');

Dumping the database

Use pg_dump (reference). To get the location on macOS, do brew info libpq, e.g. /opt/homebrew/Cellar/libpq/16.1_1/bin/pg_dump (you can add that to your PATH if you want (reference)). Example usage:

pg_dump -d postgres://postgres:password@localhost/dbname > dump.sql

Restoring from a backup: psql -h localhost -p 5432 -U postgres -d foo < dump.sql

If you already have the foo table:
- \c postgres (to switch away from foo so that you don’t get “cannot drop the currently open database”)
- drop database foo;
- create database foo

Old notes (from like 2015 😂)

========================================================================================

Simple syntax examples

Create table: CREATE TABLE movie(id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL);

Insert row: INSERT INTO movie (name) VALUES(‘Good Will Hunting’);

The “(name)” is specifying that you’re only inserting based on the name, and then the value provide is the name. If inserting datetimes, use this format: ‘1970-01-01 00:00:00’

Delete row: DELETE FROM movie WHERE id = 1;

Delete everything from a table: TRUNCATE users;

Modifying tables (reference) (adding/removing columns, constraints, renaming, etc.)

Update row: UPDATE accounts SET balance = balance + 100.00, something_else = 5 WHERE acctnum = 12345;

Create index: CREATE INDEX user_region_index ON users(region);

========================================================================================

Basics

PostgreSQL is a client/server database, so unless someone has created an embedded version by now, you would have to install the server on any target machine you use. This could be done with a script.

To stop PostgreSQL from starting up on Windows, go to services.msc and set the startup to manual (reference). You can also stop the service to make the ~6-8 related executables stop immediately (even when “pg_ctl stop -m fast” does nothing).

Pronunciation: post gress Q L (I looked up the Wikipedia IPA!)

Connection strings are in this form:

postgres://username:password @localhost/database

The default username is “postgres”, and the default database created is also “postgres”, so your basic link ends up looking like this:

postgres://postgres:fakepassword @localhost/postgres

On Windows, Postgres binaries will attempt to run as your username if you don’t specify anything, so you should probably create your user with “createuser.exe”:

createuser -P -U postgres -s agd13_000

The “-P” is so that you can provide a password for the new user. If you don’t do this, I believe you’ll be forced to change the password via pgAdmin or “alter user”, because it doesn’t seem to allow a user with no password to connect. The password is hashed (and salted with the username or something) and saved directly into the database.

A hint for the password I’m using is 0131 134 4e

The “-U” is to connect with your existing “postgres” user. If you don’t know which user you already have made, use pgAdmin III to find out. You’ll need to type in the password for this user after typing the password for your new user.

The “-s” is to make your new user a super user. The only privilege this does NOT give you is “—replication”.

To create a database: “createdb testpg” (note: this calls createdb.exe from the install location).

The tutorial that they mention has to be obtained through source. Do “select version();” from psql to find your version, then go download the source here.

Never mind, don’t follow the tutorial. Building on Windows is very tough.

A Schema contains tables, data types, functions, and operators, and is itself sort of a namespace (e.g. you can have two tables named “mytable” as long as they’re in different Schemas).

Aggregate functions are things like “max”, “min”, and “count” (reference). WHERE selects input rows before groups and aggregates are computed, and HAVING selects group rows after groups and aggregates are computed. That’s why WHERE clauses cannot contain aggregate functions. Likewise, HAVING clauses almost always contain aggregate functions.

PostgreSQL provides views (reference), but they are for organization, not for performance (reference).

Unlike MongoDB, a table appears as soon as you create it; you don’t need to actually add rows.

PostgreSQL has reserved words (reference), so “create table user (id int);” will fail with “syntax error at or near user”.

If you have a table with defaults, you can insert into it without specifying any arguments by doing this:

CREATE TABLE made_up(has_default int DEFAULT 5);

INSERT INTO made_up DEFAULT VALUES;

There will now be a single row in made_up whose has_default is set to 5.

Table aliases:

SELECT u.id FROM users AS u WHERE u.id = 1;

Note: the “AS” is optional, so this is the same as SELECT u.id FROM users u WHERE u.id = 1;

Table aliases are only necessary when joining a table to itself (or in some subqueries), but they’re convenient.

Column aliases:

SELECT id AS user_id, name AS user_name FROM users WHERE id = 1;

user_id | user_name

---------+-----------

1 | Adam

(1 row)

The PostgreSQL documentation suggests that you always use “AS” for column aliases (reference).

There is a lot of good information in the indexes chapter (reference). For example, you can make an index on (x,y), but then if you frequently do searches using operators on just ‘y’, you won’t be able to make full use of the index. The query planner figures out whether to even use an index at all, as sometimes a sequential search will be faster.

Use partial indexes when you want to ignore common values (e.g. IP address access from ‘192.168..’) that account for more than a few percent of the total number of rows (reference).

You should use EXPLAIN/ANALYZE to figure out which indexes are actually being used (reference). You can add an index in realtime, but doing so is discouraged since it prevents writes by default (see CONCURRENTLY for how to still allow writes).

Always test your indexes with something like 1000 rows out of 100000 rows (of real world data, not just test data). Fetching 1 row out of 100 will probably ignore the index entirely.

You can SELECT “FOR” something, which is called a “locking clause” (reference). E.g. “SELECT FOR UPDATE” causes rows retrieved by SELECT to be locked as though for update. This prevents them from being locked, modified, or deleted by other transactions until the current transaction ends. It does NOT prevent querying, only writing/locking! Also, apparently it may not even ensure that concurrent updates/deletes don’t happen with a selected row unless you actually update the row (reference).

Foreign and Primary keys

You can have multiple foreign keys in a table. For example, if you had a really naïve friend list that could contain exactly three friends, you would do this:

CREATE TABLE users (

id int primary key,

friend1 int references users(id),

friend2 int references users(id),

friend3 int references users(id)

);

Also, your primary key or your foreign key could refer to a group of columns:

CREATE TABLE example (

a integer,

b integer,

c integer,

PRIMARY KEY (a, c)

);

Let’s say you delete a row in Table X that was acting as a foreign key in Table Y (reference). For example, you delete a Movie row despite that some people had that listed as their favorite movie in the movie_preferences table. You could mark the foreign key as “ON DELETE RESTRICT” which wouldn’t let you delete the movie, or “ON DELETE CASCADE” which would delete the preference row in addition to deleting the movie. “NO ACTION” (which is the default) will simply raise an error when you delete the movie (in MySQL, “NO ACTION” and “RESTRICT” are actually the same, not sure about PostgreSQL).

Example of NO ACTION:

testpg=# DELETE FROM movie WHERE id=1;

ERROR: update or delete on table “movie” violates foreign key constraint “prefs_movie_id_fkey” on table “prefs”

DETAIL: Key (id)=(1) is still referenced from table “prefs”.

========================================================================================

Transaction isolation

There are three phenomena that could be caused by concurrent transactions (and the Wikipedia article is GREAT for this; keep in mind that PostgreSQL uses Multiversion Concurrency Control):

Dirty read: a transaction reads data written by a concurrent uncommitted transaction.
- This is never possible in PostgreSQL, as the Read-Uncommitted isolation level actually maps to Read-Committed, which guarantees that you won’t get dirty reads.
Nonrepeatable read: a transaction re-reads data it has previously read and finds that data has been modified by another transaction (that committed since the initial read).
- With a little help from StackOverflow (reference), here’s how this can happen:
  - BEGIN; — this is the same as BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED, because that’s what PostgreSQL defaults to.
  - SELECT name FROM users WHERE id=1; — “Sam”
  - — … other stuff perhaps…
  - — At some point, another transaction commits a change to the ‘users’ table.
  - — We rerun the original query within this transaction:
  - SELECT name FROM users WHERE id=1; — “Samantha”. Despite being the same query, a row was modified! This is a nonrepeatable read.
  - COMMIT;
Phantom read: (note: going to paste a quote and the explanation after; the quote makes no sense when you first read it, but it is precise): “a transaction re-executes a query returning a set of rows that satisfy a search condition and finds that the set of rows satisfying the condition has changed due to another recently-committed transaction.” An explanation as to how this differs from non-repeatable reads: the rows you’re getting back will have the same values as before, but now different rows are being selected (because the concurrent transaction deleted/inserted rows). Here’s how this can happen:
- BEGIN;
- SELECT year_of_earthquake FROM nat_disaster WHERE country=‘US’; — Returns 1970, 1975, 1990
- — At this point, another transaction adds in a new row.
- — We rerun the original query within this transaction:
- SELECT year_of_earthquake FROM nat_disaster WHERE country=‘US’; — Returns 1970, 1975, 1990, 2015. We got the same values as before but with an addition! This is a phantom read.

Here’s a useless conclusion that I drew from the above: nonrepeatable reads and phantom reads can appear to be functionally identical in this case:

Nonrepeatable read: “Sam” becomes “Samantha”
Phantom read: delete “Sam”, add in “Samantha” with the same ID.
- (you would never do this in a real database)

Here are the “four” isolation levels for PostgreSQL (“read uncommitted” is internally the same as “read committed”, so there are really only three):

[TABLE]

Remember that Read Committed is on by default, which means that the same query performed multiple times in Transaction X can see changes, insertions, and deletions from concurrent transactions committed while Transaction X is running.

Note: you can test the below “problems” by just running two instances of PSQL on your machine. You don’t even need to be on two different user accounts. Let’s say you run Transaction 1 below and then run Transaction 2 when the comment says to. T2 won’t actually finish until T1 is committed. If you go into pgAdmin —> your database —> catalogs —> pg_catalog —> views —> pg_locks, you’ll see that T1 acquires a RowExclusiveLock and two ExclusiveLocks. Then, T2 requests ExclusiveLock, ShareLock, and AccessExclusiveLock, and it obviously can’t get them because T1 has at least one of them.

— Transaction 1

BEGIN;

UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 12345;

UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 7534; — Transaction 2 is now run.

COMMIT;

— Transaction 2

BEGIN;

DELETE FROM accounts WHERE balance = 100;

COMMIT;

With Read-Committed isolation (which is the default for transactions (reference)), Transaction 1 can be run however many times you want concurrently and you’ll never run into cases where you create/destroy “money” as you would in typical threading scenarios. This is because each Read-Committed transaction only sees data committed before the “BEGIN;” statement; it never sees uncommitted data, and it never sees data committed by concurrent transactions.

========================================================================================

Good database practices

Every table should have a primary key (reference).

Use NOT NULL wherever you can (reference). Primary keys don’t need to be marked as NOT NULL because that’s implied. Foreign keys, however, can have NULL, which means it’s up to you to decide if you want to make it NOT NULL.

You should supply a name with constraints so that if they end up being violated, you can know which constraint it was (reference).

CREATE TABLE products (

product_no integer,

name text,

price numeric CONSTRAINT positive_price CHECK (price > 0)

);

OIDs should not be assumed to be unique. In large databases, they can wrap.

Always index columns that you plan on searching on (reference). Indexes take up disk space (which means updating a row in the table will update the index), which usually means it costs more real-life money to index a table if you’re on a cloud platform like AWS or Azure. So don’t add indexes willy-nilly; identify where you need them and use them.

To make JOINs faster, make sure the columns that you’re joining on are indexed.

If you are expecting a unique row (e.g. get a user with id==571), then put “LIMIT 1” in your query to speed it up (reference).

SELECT id,name FROM users WHERE id = 1 LIMIT 1;

If you’re NOT getting a unique row, then this should not be done without “ORDER BY”, otherwise the results you get will be unpredictable.

Use the smallest data types possible for each column. In paticular, with PostgreSQL:

Character vs. varchar vs. text: there is no real performance difference among these three types in PostgreSQL (reference). “In most situations text or varchar should be used instead [of character]“. Varchar(80) will NOT use 80 bytes (plus some overhead) when storing the string “Adam”.

Performance tips:

When doing lots of commits (e.g. to populate a database for the first time), issue a BEGIN, then do the INSERT commands, then issue a COMMIT (reference). This is better than just issues tons of INSERT commands. Even better is to just use COPY to read from a file to populate a database.
Creating an index of a pre-existing table is faster than updating the index each time a row is added, so if you’re populating a table for the first time, remove the indexes first, then add them.
- Likewise, do this for foreign keys too.

========================================================================================

PSQL

”psql” is the shell for PostgreSQL. You can specify a database as you start it up: “psql testpg”.

The prompt itself is very important. You’ll usually see something like “dbname=#“. The ”#” means you’re a superuser (reference).

A ”=” means that there is currently no context in the shell; you are free to type any command.
A ”-” means that you’re typing in a command and haven’t entered the semicolon yet. E.g. if you type “select * from users”, you will see the prompt change to a hyphen.
A ”(” means that you have one or more open parentheses somewhere. This symbol will never change until you close all parentheses.

If you don’t see capitalized text when you finish an operation (e.g. “CREATE TABLE” when creating a table), then you should check your prompt! Odds are it shows a ”-” because you haven’t put a semicolon, or maybe you have a ”(” there because you didn’t close your parens.

Semicolons are not implicit like in JavaScript! If a command says to type “SELECT version();” then you’d better add the semicolon! This is because you may want a “CREATE TABLE” command or something to span multiple rows to make it easier to type/read.

Key words are case-insensitive, although you should always capitalize them so that they stand out from other identifiers.

”l” will list all databases.

”dt” will list all tables in the current database. Remember: something became the current database because you connected to it, probably via “psql dbname” when you started psql. You can switch databases with “connect dbname” (reference).

========================================================================================

Install (reference)

This includes pgAdmin III, which you run just like any other program on Windows (Windows key —> type “pgadmin”).

I may not have HAD to do this, but I added “C:Program FilesPostgreSQL9.4bin” (or usr/local/pgsql/bin) to my PATH so that I could call “createdb” from the command line.

========================================================================================

Node interface

Connection pool example:

var conString = “postgres://postgres:fakepassword @localhost/postgres”;

//this initializes a connection pool

//it will keep idle connections open for a (configurable) 30 seconds

//and set a limit of 20 (also configurable)

pg.connect(conString, function(err, client, done) {

if(err) {

return console.error(‘error fetching client from pool’, err);

}

client.query(‘SELECT $1::int AS number’, [‘1’], function(err, result) {

//call done() to release the client back to the pool

done();

if(err) {

return console.error(‘error running query’, err);

}

console.log(result.rows[0].number);

//output: 1

});

Explanation:

“pg.connect” - if there is a ready client in the pool, it will use that. If not and the pool is not full, it will create a new connection. If the pool is full, it will wait for a Client to be available. Your callback will only be called when the Client is ready to be used (or if an error is encountered).

Rationale:

I looked through the code and documentation. Node-postgres is backed by generic-pool, and its acquire function says this (typos are theirs): “The callback will be called, when a new client will be availabe, passing the client to it.”

Here’s why this is important: this lets you have many different queries running at once without knowing whether they’re actually connected to the server. For example, suppose you wanted to query the User table 1000 times. If you just made a new pg.Client and went at it, it would do them all sequentially. With a pool, you can process up to 20 (see pg.defaults) at a time. The default max_connections to a PostgreSQL server is 100, so you definitely want to pool connections on the client.

As shown in the connection-pool example, you can submit a query with parameterization (reference):

query(‘select name from emp where emp_id=$1’, [123]);

This protects against SQL injections, but it’s slower than a plain query. You can get some of the speed back with prepared queries:

query({

name: ‘emp_name saved query’,

text: ‘select name from emp where emp_id=$1’,

values: [123]

});

Instead of cramming everything into the callback for query(), you can instead use events for ‘row’, ‘error’ and ‘end’ (reference):

var query = client.query(‘SELECT name, age as user_age FROM users’);

query.on(‘row’, function(row, result) {

result.addRow(row);

});

query.on(‘error’, function(error) {

//handle the error

});

query.on(‘end’, function(result) {

//fired once and only once, after the last row has been returned and after all ‘row’ events are emitted

//in this example, the ‘rows’ array now contains an ordered set of all the rows which we received from postgres

console.log(result.rowCount + ’ rows were received’);

})

Those are the most important details of the Node interface.k

========================================================================================

pgAdmin III

To change your password, go to Server Groups —> Servers —> (your server) -=-> Login Roles —> click a name —> File —> Change Password…

To view tables, you really have to drill down. Server Groups —> Servers —> (your server) —> Databases —> (your database) —> Schemas —> public —> Tables —> (your table) —> click the button in the top bar that looks like a table.

There’s also a table-with-filter icon that you can press, then type any filter like “temp_lo < 45”. You can validate this filter before executing it.

========================================================================================

Comparison to MySQL

EveryJuan: Adam13531, PostgreSQL has a single node system with a async replication, which would have a ~5sec lag, so not good for real time, MySQL can have sync replica nodes, or better yet, master-master systems through stuff like percona and galera, we use percona in production and love it, they have async alter statements, don’t have to wait for migrations in production

Community edition of MySQL is free, as is Percona. Same license as PostgreSQL.