Hello! My name is Vidal Ekechukwu, and in this episode of PG Casts we’ll be using Lateral Joins to more efficiently aggregate columns. Ready? Let’s go.
Postgres Version 9.5
Let’s say we have a table of developers:
Each developer has an
id as its primary key and a username.
We also have a table of developer activities.
Looking at its structure here and at some dummy data, we’ll see that each activity references both a developer and an event type.
All you devs out there will quickly recognize that this is a short list of the actions you can perform on Github.
You can find details about how I set up the data in the show notes below, but for now, we’re going to dive right into dissecting them.
Let’s say for each developer, you want an array of his activities’ event types with a size no greater than five.
As you might expect, a simple join, coupled with a group by, and an
will get us mostly there. Let’s have a look.
select d.id, array_agg(da.event_type) activities from developers d join ( select event_type, developer_id from activities ) da on d.id=da.developer_id group by d.id order by d.id;
This is a good start, but as we said, we want at most five of each developer’s activities in the array. So how do we do that?
Now, you might think that you could just throw a
limit 5 clause in the join
subquery, call it a day, and go pour yourself a glass of lemonade. Oh but you’d
be soo WRONG!
Let’s see what happens if you try that.
select d.id, array_agg(da.event_type) activities from developers d join ( select event_type, developer_id from activities limit 5 ) da on d.id=da.developer_id group by d.id order by d.id;
Notice something fishy? Are your Postgres senses tingling? Well they should be. If you look closely, you’ll notice a couple things. First, we’re not getting back all the developers who returned an activities array in our first query. Secondly, of the developers that are coming back, we’re not getting anywhere near the desired activities count.
The reason this happens is because of the limit clause. As Postgres iterates
over each developer, it only looks at the first five of all activities and
returns a row only if the developer’s
id happens to match one of those five
What we really want Postgres to do is to apply the limit not to all activities,
but only activities whose
developer_id matches the current developer’s
instance, if we only cared about one developer, we could do something as simple
as adding a where clause to the join subquery like this:
select d.id, array_agg(da.event_type) activities from developers d join ( select event_type, developer_id from activities where developer_id=1 limit 5 ) da on d.id=da.developer_id group by d.id order by d.id;
This is great, but now we only get one row… for one developer. How do we do this for every developer?
Maybe we can change the where equivalence statement from
developer_id=d.id. Let’s see what happens:
select d.id, array_agg(da.event_type) activities from developers d join ( select event_type, developer_id from activities where developer_id=d.id limit 5 ) da on d.id=da.developer_id group by d.id order by d.id;
Egggghhhhhhhh. That didn’t work. Postgres yells at us for not having access to
d alias inside the join subquery.
This is where the
lateral keyword comes into play. Watch what happens when we
put it right before the join subquery.
select d.id, to_json(array_agg(da.event_type)) activities from developers d join lateral ( select event_type, developer_id from activities where developer_id=d.id limit 5 ) da on d.id=da.developer_id group by d.id order by d.id;
Tada!! We get what we want. We see a list of all the developers who have activities, and we see the ones that have up to five! But wait, what happened?
As it turns out, the
lateral keyword gives us access to the columns provided
by the preceding
from item, which, in this case, is the current developer. We
take that developer’s
id and use it to limit the activities before trying to
join the two tables.
There are some other clever ways to achieve this without using the
keyword. They may employ
common table expressions or
window functions, but
we won’t cover those in this episode. If you’re able to use these tools to
achieve the same result, hit us up, and we may include it in a future episode.
Until then, thanks for watching, and please leave feedback in the comments below.
-- Create developers table create table developers ( id integer primary key, username text not null ); -- Generate developers insert into developers (id, username) select dev_id, 'dev' || dev_id from generate_series(1, 10) as dev_id; -- Create event_types table create table event_types ( name text primary key ); -- Generate event types insert into event_types (name) values ('push'), ('pull'), ('fork'); -- Create developer activities table create table activities ( id serial primary key, developer_id integer not null references developers(id), event_type text not null references event_types(name) ); -- Generate developer activities with recursive random_activities (row_num, developer_id, event_type) as ( ( select 1, d.id, et.name from developers d cross join event_types et order by random() limit 1 ) union ( select row_num+1, d.id, et.name from random_activities cross join developers d cross join event_types et where row_num < 50 order by random() limit 1 ) ) insert into activities (developer_id, event_type) select random_activities.developer_id, random_activities.event_type from random_activities;