PG Casts

Working With Case Insensitive Data

Episode #9 hosted by Josh Branchaud

In this episode, we're going to be looking at considerations we need to make when working with case-insensitive data.

Postgres Version 9.5

Transcript

A good example of data that we may want to treat as case-insensitive are user emails. If our user is trying to login to our application with their email address, it shouldn't matter if they use all caps or all lowercase when typing their email. We should recognize them either way.

We already have a user table with a bunch of records.

create table users (
  id serial primary key,
  email varchar not null unique
);
insert into users (email)
select 'person' || num || '@example.com'
from generate_series(1,10000) as num;

Let's look at the description of the users table:

\d users

Let's see that we have some users in our table:

select count(*) from users;
select * from users limit 10;

So back to the scenario of looking up the record for a user that is trying to login.

Let's say they happened to capitalize the first letter of their email, so we aren't going to find their record.

select * from users where email = 'Person5000@example.com';

Let's wrap everything in the lower() function to make sure that we always get a consistent comparison of the emails:

select * from users where lower(email) = lower('Person5000@example.com');

Seems fine, but we've introduced a bit of a catch. We are no longer taking advantage of the index on our email column. We can see that by doing an explain analyze on the previous query:

explain analyze select * from users where lower(email) = lower('person5000@example.com');

Postgres is doing a full sequential scan now. The reason for this is because our one index on the email column is just for email on its own. We need a different index if we are going to be querying the email column frequently in conjunction with the lower() function.

Let's add one:

create index lower_email_idx on users (lower(email));

We can see the index by looking at the table description:

\d users

Let's try that select query from earlier to see that we are now getting the benefits of an index again:

explain analyze select * from users where lower(email) = lower('person5000@example.com');

As you can see, instead of a sequential scan, the query is able to do an index scan. This will be considerably faster. Sure, this will make writing to this table a tad slower, but the benefits here definitely outweigh the costs.

The lesson learned here is to always be cognizant of how you most often access your data to ensure that you aren't missing out on the benefits of an index.

That's it for this episode. Thanks for watching.