Postgres and the tale of the changing defaults
It is not often i have an issue with Postgres, Sure i don't like some of the naming or how it goes about things certain thinks. But the one thing i do love about it is release to release you are generally not going to have problems. Imagine my surprise when a "tested as working" setup in staging failed when deployed to production.
We recently installed Postgres 9.6 side by side Postgres 9.2 on a testing machine to allow us to ensure the codebase worked on newer versions. This is relatively easy to set up with Postgres and Debian and required little more than configuring Postgres to use a diffrent port. THis worked out well as if there ended up being an issue, we where no worse off than a simple uninstall.
Things worked out nicely, The data copied, The code tested and confirmed as working and no real reported issues. With everyone happy we proceeded to upgrade the production stack to the newer Postgres. The only real difference was that we would upgrade production to 9.6 rather than have it run side by side.
This was a straight forward migration. A dump was performed, The database package upgraded and backups restored. People were asked to check and indicated that all seemed fine. After a nice nights sleep and a casual thought to the great uneventful work completed the previous day, We returned to work to discover messages in our inbox with big scare inducing words such as "emergency" and "database" in them. While a suitable tuned spidy sense could have picked up that there could be an problem, The regulation issued sysadmin one sufficed.
As one might imagine, Production was not checked as thoroughly as one would like. In fact all interactive checking had been against pages that would have never hit the database (eg the front login page). While i always wish people would have a comprehensive plan to check the site is working, In practice i have only ever encountered such a thing a handful of times.
After checking the error logs (It mystifies me how even these days, developers are not aware that their error logging goes somewhere that can be retrieved) I discovered numerous authentication issues of the nature:
$FOO could not log into $BAR incorrect peer credentials
one intresting thing about this setup is that the usernames to log in don't
match the system usernames. What i was expecting was
web$FOO could not log
into $BAR. Postgres, Like mysql can be accessed via a unix socket. When using
this mode Postgres can "see" the username of the person connecting via socket
calls to the kernel. This is great as they are unforgable but don't work as
well when your system and db usernames don't line up. In
pg_hba.conf this is
normally referenced as "peer" authentication (hopefully some of the parts are
falling into place now).
Resetting passwords on the off chance they where not migrated correctly and attempting to replicate the problem via the psql cmdline did not help in any way. everything checked out with psql when manually setting up the credentials.
At this point after reviewing
pg_hba.conf it became apparent that the Unix
socket was being used, Not the MD5 passwords over localhost like i would
expect. As the client code should not have changed (an assumption on my part
but one i could easily verify as i went) i could only assume that something
was not being explicitly set and a default had changed. A very quick search of
the Postgres 9.5 and 9.6 change logs did not seem to indicate anything
In an attempt to sort the issue out a quick grep of the src code was done to
look for a "localhost" string to locate the DB settings. This returned zero
results and started to confirm my suspicions that this was a default behaviour
change. After further refining of the search i managed to locate some oddly
named settings that configured how the database was set up. It appears that
everything was specified including the port. The only omission was that the
host= field to set the path or address of the remote server was not present.
Much to my dismay it appears that the code had no config variable to set the hostname of the remote server in most of the code bases. Luckily someone had hacked that functionality into one of the affected sites and i was able to set the host and confirm this fixed the issue. A simple cut and paste, Then a search replace of Port to Host and the site was back in business and no longer throwing authentication issues.
After the site was up, People rung and emails sent (this part taking significantly longer than diagnosing and fixing the issues), I had a chance to poke the Postgres 9.6 changelog with greater insight as to which entry i may have been affected by. One obvious choice stuck out:
* Improve libpq's PQhost() function to return useful data for default Unix-socket connections (Tom Lane) Previously it would return NULL if no explicit host specification had been given; now it returns the default socket directory path.
TL;DR: Postgres's client library changed how it handles the default connection string, Causing our code to switch from connecting via IPv4 to connecting via a Unix socket and peer auth. As the user/database are different to the system user, Connection attempts by the code base received an authentication error. The fix was to specify "localhost" explicitly.
After reviewing the change notes it became obvious what had happened, However looking at the wording of the changelog entry it is no surprise that this was missed at my end. I don't have any advice to offer here other than this is an intresting tale from the trenches that others may find entertaining. Note that i was not responsible for the migration, Only the fixing.
If you want to see more of these then let me know via twitter @Blitz_Works
The above is a roughly remembered summary of the situation, Fixing everything when it went bad took approximately 15 minutes form start to finish. Some details omitted to protect the innocent. If pain persists for more than 4 hours, Please see a senior sysadmin.