PostgreSQL UPDATE Gotcha

Background

I recently came across an interesting PostgreSQL “gotcha” at work while writing a script to resolve a data discrepancy in a distributed system. To illustrate how I stumbled upon this “gotcha”, I’ll pose a hypothetical system and scenario which mirrors my own experience.

System Design Overview

Scenario

Suppose I work in an E-Commerce Checkout system. Suppose I recently built a feature to enable an online shopper to merge all their duplicate shipping addresses into a single preferred shipping address. From a technical standpoint, suppose I’m working in a distributed system where the shipping address data lives in one system, and the data for merging possible duplicate shipping addresses lives in another.

Address Service

This service acts as the source of truth for all address-related data. Clients query its API or consume emitted events from this service to know things like a customer’s billing address or shipping address options. For the purpose of this post, I focus solely on address data.

DB

addressdb> create table shipping_addresses(
   id  varchar primary key, 
   location varchar -- For the sake of simplicity; e.g. '17 Irving Pl, New York, NY 10003'
);

Entity Merge Service

This service deals exclusively with deduplicating entities like customer profiles, addresses, and items across the system.

entitymergedb> create table duplicate_shipping_addresses(
    id  varchar primary key,
    shipping_address_ids jsonb
);

The Problem

Given the following example data set:

addressdb> select * from shipping_addresses;
  id  |               location
------+---------------------------------------
 sa-1 | 2362 Steinway St, Astoria, NY 11105
 sa-2 | 23-62 Steinway St, Astoria, NY 11105
 sa-3 | 23-62 Steinway St, Queens, NY 11105
 sa-4 | 23-62 Steinway St, New York City, NY 11105
 sa-5 | 2362 Steinway St, New York City, NY 11105
(5 rows)
--
entitymergedb> select * from duplicate_shipping_addresses;
  id   |           shipping_address_ids
-------+------------------------------------------
 dsa-1 | ["sa-1", "sa-2", "sa-3", "sa-4", "sa-5"]
(1 row)

Let’s say Shipping Addresses with IDs sa-1, sa-3, and sa-5 were deleted in the Address Service, but for some reason didn’t appropriately propogate to the Entity Merge Service. My task was to resolve the descrepancy between these two services. I began writing a one-off script to remove these orphaned IDs from the duplicate_shipping_addresses table.

First Solution

My initial solution to this problem was an iterative approach: For every deleted ID, if it exists in duplicate_shipping_addresses.shipping_address_ids, then remove it from the record’s array—one at a time. In SQL, this looks like:

with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
)
update duplicate_shipping_addresses dap
set shipping_address_ids = dap.shipping_address_ids - dsa.id
from deleted_shipping_addresses dsa
where dap.shipping_address_ids ? dsa.id;

This approach would make some sense if it were written in a language that has procedural capabilities, like Python:

deleted_shipping_address_ids = ["sa-1", "sa-3", "sa-5"]
duplicate_shipping_addresses = read_all()

cartesian_product = [(deleted_id, duplicate_shipping_address_record) 
    for deleted_id in deleted_shipping_address_ids 
    for duplicate_shipping_address_record in duplicate_shipping_addresses
]

for (deleted_id, duplicate_shipping_address_record) in cartesian_product:
   if deleted_id in duplicate_shipping_address_record.shipping_address_ids:
    duplicate_shipping_address_record.shipping_address_ids.delete(deleted_id)

update_all(duplicate_shipping_addresses)

However, since SQL is generally a declarative language, the way updates are ochestrated by the query planner were different than what I had expected.

💡 Before reading on, pause for a second to guess what happens when the above SQL query is run on the example data set.

Answer

For this example, the join on record dsa-1 will produce 3 output rows that could be imagined as:

entitymergedb> with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
)
select dap.shipping_address_ids, dsa.id deleted_id, dap.shipping_address_ids - dsa.id resultant
from deleted_shipping_addresses dsa, duplicate_shipping_addresses dap
where dap.shipping_address_ids ? dsa.id;

           shipping_address_ids           | deleted_id |            resultant
------------------------------------------+------------+----------------------------------
 ["sa-1", "sa-2", "sa-3", "sa-4", "sa-5"] | "sa-1"     | ["sa-2", "sa-3", "sa-4", "sa-5"]
 ["sa-1", "sa-2", "sa-3", "sa-4", "sa-5"] | "sa-3"     | ["sa-1", "sa-2", "sa-4", "sa-5"]
 ["sa-1", "sa-2", "sa-3", "sa-4", "sa-5"] | "sa-5"     | ["sa-1", "sa-2", "sa-3", "sa-4"]
(3 rows)

You might expect each output row to remove an ID from shipping_address_ids. In practice, what happens is that only a single output row is applied on update. It is non-deterministic which output row is chosen by the planner.

Watch what happens if I run the script multiple times in succession:

-- ####################
-- ### First update ###
-- ####################
entitymergedb> select * from duplicate_shipping_addresses;
  id   |           shipping_address_ids
-------+------------------------------------------
 dsa-1 | ["sa-1", "sa-2", "sa-3", "sa-4", "sa-5"]
(1 row)

entitymergedb> with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
)
update duplicate_shipping_addresses dap
set shipping_address_ids = dap.shipping_address_ids - dsa.id
from deleted_shipping_addresses dsa
where dap.shipping_address_ids ? dsa.id;

entitymergedb> select * from duplicate_shipping_addresses;
  id   |       shipping_address_ids
-------+----------------------------------
 dsa-1 | ["sa-2", "sa-3", "sa-4", "sa-5"]
(1 row)

-- #####################
-- ### Second update ###
-- #####################
entitymergedb> with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
)
update duplicate_shipping_addresses dap
set shipping_address_ids = dap.shipping_address_ids - dsa.id
from deleted_shipping_addresses dsa
where dap.shipping_address_ids ? dsa.id;

entitymergedb> select * from duplicate_shipping_addresses ;
  id   |   shipping_address_ids
-------+--------------------------
 dsa-1 | ["sa-2", "sa-4", "sa-5"]
(1 row)

-- #####################
-- ### Third update ####
-- #####################
entitymergedb> with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
)
update duplicate_shipping_addresses dap
set shipping_address_ids = dap.shipping_address_ids - dsa.id
from deleted_shipping_addresses dsa
where dap.shipping_address_ids ? dsa.id;

entitymergedb> select * from duplicate_shipping_addresses ;
  id   | shipping_address_ids
-------+----------------------
 dsa-1 | ["sa-2", "sa-4"]
(1 row)

-- ##############################
-- ### Fourth update (no-op) ####
-- ##############################
entitymergedb> with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
)
update duplicate_shipping_addresses dap
set shipping_address_ids = dap.shipping_address_ids - dsa.id
from deleted_shipping_addresses dsa
where dap.shipping_address_ids ? dsa.id;

entitymergedb> select * from duplicate_shipping_addresses ;
  id   | shipping_address_ids
-------+----------------------
 dsa-1 | ["sa-2", "sa-4"]
(1 row)

The resulting behavior is that only one ouput row is applied per invocation. Why does this happen? While the Postgres documentation doesn’t give a reason, the UPDATE documentation does specify that this behavior is indeed intentional:

📄 When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_item list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn’t join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable. 1

Even though this solution solves the problem, it isn’t practical to run this script an indeterminate amount of times until all orphaned IDs have been removed.

Second Solution

Since my initial solution wasn’t viable, my next attempt separated the logic into two steps:

  1. Map all Duplicate Shipping Address IDs to an array of deleted Shipping Address IDs that should be deleted from that row.
  2. Update all Shipping Address IDs in the Duplicate Addresses table to be the difference between what’s presently in that row to the mappings from the previous step.

In SQL, this looks like:

with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
), duplicate_id_to_address_ids as (
    select pairs.duplicate_shipping_address_id, array_agg(pairs.shipping_address_id) address_ids from (
        select id duplicate_shipping_address_id, jsonb_array_elements_text(shipping_address_ids) shipping_address_id from duplicate_shipping_addresses
    ) pairs
    join deleted_shipping_addresses d on d.id = pairs.shipping_address_id
    group by pairs.duplicate_shipping_address_id
)
update duplicate_shipping_addresses dsa
set shipping_address_ids = dsa.shipping_address_ids - dsai.address_ids
from duplicate_id_to_address_ids dsai
where dsa.id = dsai.duplicate_shipping_address_id;

Below is an example run of this query to demonstrate:

entitymergedb> select * from duplicate_shipping_addresses;
  id   |           shipping_address_ids
-------+------------------------------------------
 dsa-1 | ["sa-1", "sa-2", "sa-3", "sa-4", "sa-5"]
(1 row)

entitymergedb> with deleted_shipping_addresses as (
    select id from (values('sa-1'),('sa-3'),('sa-5')) a(id)
), duplicate_id_to_address_ids as (
    select pairs.duplicate_shipping_address_id, 
          array_agg(pairs.shipping_address_id) address_ids 
    from (
      select  id duplicate_shipping_address_id, 
              jsonb_array_elements_text(shipping_address_ids) shipping_address_id 
      from duplicate_shipping_addresses
    ) pairs
    join deleted_shipping_addresses d 
      on d.id = pairs.shipping_address_id
    group by pairs.duplicate_shipping_address_id
)
update duplicate_shipping_addresses dsa
set shipping_address_ids = dsa.shipping_address_ids - dsai.address_ids
from duplicate_id_to_address_ids dsai
where dsa.id = dsai.duplicate_shipping_address_id;
UPDATE 1

entitymergedb> select * from duplicate_shipping_addresses ;
  id   | shipping_address_ids
-------+----------------------
 dsa-1 | ["sa-2", "sa-4"]
(1 row)

Additionally, unlike the first solution, this script is idempotent. In a vacuum, once the IDs have been removed, additional runs of this script will not alter the state of the table.

Closing Thoughts

Some takeways from this experience are:

  1. Always write a SELECT statement representation of your UPDATE statement to verify that the correct rows are returned.
  2. Identify if a script should be idempotent, and if so, ensure that only the first invocation of the script results in changes to the DB.
  3. Write tests to verify that the script has actually left the DB in a state that you expected.
  4. And lastly, ensure that an UPDATE isn’t attempting to update a row multiple times. This can be demonstrated by following step 1., asserting that only a single row is returned per unique constraint.