Tech stories

Efficient Data Cleanup in Partitioned PostgreSQL Tables using Common Table Expressions

By Dwarka Rao, Database Engineer & Aakash Agarwal, Java Software Engineer.

March 21, 2024
 ·  10 minutes
Partitioning illustration

Effective data management is critical in PostgreSQL, especially when dealing with sizable datasets.This statement is true for all the RDBMS engines. Partitioned tables offer a structured approach to organising data. Still, the challenge arises when it has to do with efficiently cleaning up data within these partitions. In this blog post, we'll explore a method that utilises Common Table Expressions (CTEs) to streamline scanning and deleting relevant data in partitioned PostgreSQL tables.

Using Common Table Expressions (CTEs) as a crucial cleanup tool

Efficiently cleaning up data within partitioned tables in PostgreSQL introduces several intricate challenges that demand careful consideration – Identifying specific data for deletion requires a sophisticated approach. Common Table Expressions (CTEs) emerge as a crucial tool in this context; they provide a structured mechanism to traverse the partition hierarchy and pinpoint relevant data. As the cleanup operation unfolds, the potential impact on performance looms large, given the distributed nature of data within partitioned tables. Traditional DELETE operations can be resource-intensive, but one can mitigate the impact on overall performance by incorporating optimization techniques such as batch processing and leveraging indexes in WHERE conditions. Also, maintaining consistency across partitions requires a meticulous balance that emphasises the importance of atomic transactions and well-designed CTEs to uphold database integrity.

Our Story

In a recent endeavour, we encountered a challenge while aiming to remove a substantial volume of data from numerous partitioned tables. This task posed several complexities: since each partitioned table approached a Terabyte size, opting for an immediate data deletion strategy within the partitions would be  careless. Potential restoration complications, especially during validation errors, were also daunting. Additionally, devising a precise logic to pinpoint the data eligible for deletion introduced an extra layer of complexity to the task. Balancing the need for a thorough cleanup with the risk of irreversible actions became critical in tackling this challenge.

In the following sections, we’ll share details of the particular use case of this challenge, as well as our approach and implementation.

Use Case

We structured our table with weekly partitions and housed Entity event overviews in an append-only format. These overviews functioned as consolidated snapshots of past events for a specific entity, each version distinguished by the stateid column. Here is the parent table’s structure for your reference:

However, Postgres’ limitations made it more complex when we attempted to update the same event overview: updates triggered the vacuum and analysis processes. They presented complications for our sizable and high-throughput table. As a result, we decided against modifying existing overviews. Instead, we read from partitions to retrieve the latest event overview using eventId and stateId indexes.  During write operations, we fetched the last overview, calculated the final version, and stored it in the latest weekly partition by inheritance. And below is how the child tables look like:

This decision was strategic; it ensured that all prior iterations would be obsolete and never used again once we generated a new version. When we recognised that the sheer volume of data in those partitions had surpassed practical limits, we had to switch our focus to efficiency. This focus prompted us to streamline the database by eliminating redundant and outdated records.

Approach

The new challenge was cleaning up data on nearly a hundred partitions. So, we devised an automated approach, and we could trust its logic to identify deletable data. This approach would also be reversible if the validation raised red flags. If we split the entire task into multiple phases, it would follow these concepts/processes below in no particular order:

  • Identifying deletable data

  • Performance considerations

  • Maintaining database consistency

  • Testing and Validation

Let’s expound on each of them:

Identifying deletable data

All the tables we wanted to clean up had dependencies on other tables to help us identify deletable data. Hence, using CTEs made absolute sense. Common Table Expressions (CTEs) provide a powerful tool for creating temporary result sets within a query. By leveraging CTEs, we can construct a recursive query to traverse the partitions and identify the data requiring deletion. In addition to that, CTEs play a pivotal role in streamlining the identification and filtering of deletable data, especially when dealing with complex relationships across multiple partitioned tables. When confronted with the challenge of deleting data from such tables, employing CTEs becomes an indispensable strategy. These temporary result sets provide a concise and organised way to articulate the logic required for identifying data for deletion.

Consider this example below involving two CTEs in an ecommerce platform with tables for customers, orders, and products. The goal is to retrieve a report showing each customer’s top-selling products based on the total amount spent:

Performance Considerations

One must carefully consider performance when delving into the intricate task of cleaning up data within partitioned tables. The sheer size of partitioned tables, often approaching Terabytes, can pose a challenge regarding the performance impact of deletion operations. Optimization techniques are crucial in addressing this; they ensure the cleanup process is effective and resource-efficient.

One key optimization strategy is the implementation of batch processing. Rather than executing a single, resource-intensive DELETE operation on the entire dataset, breaking the operation into smaller, manageable batches can significantly mitigate the impact on system resources. This process prevents potential bottlenecks and allows for better monitoring and control over the deletion process.

Furthermore, leveraging the appropriate columns in WHERE clauses optimises performance. Making the most of Primary Keys and Indexes ensures that the database engine can quickly locate and process the relevant records for deletion.

Maintaining Database Consistency

Ensuring database consistency during the cleanup process is critical to avoiding potential pitfalls and maintaining data integrity. 

Using transactions plays a pivotal role in achieving this goal. By encapsulating the data cleanup operations within a transaction block, administrators can guarantee that the changes are treated as a single atomic unit of work. This means that all cleanup operations are successfully executed, or none are. In the event of an error or interruption, the database can be rolled back to its original state, preventing partial or inconsistent modifications and safeguarding the overall consistency of the dataset.

Testing and Validation

Before applying cleanup operations to a production database, it's essential to conduct extensive testing in a controlled environment. This includes testing different scenarios, edge cases, and the impact on database performance to identify and address potential issues.

Implementation

So, we started with the above mentioned approach and spent some time finding the most efficient way to find deletable data using CTEs.

We tested the CTEs thoroughly and moved on to write a Stored Procedure to use for multiple partitions. The idea was to make the Stored Procedure with below tasks (steps):

  • open a transaction

  • create an empty partition (new)

  • move all the relevant data to this new partition in customizable batch size adjust the pointer to move to the next batch

  • iterate till it moves all relevant data.

Once the Stored Procedure completes these steps, it:

  • starts a loop to get a lock with the lock_timeout parameter on the session level (this lock was sufficient in our use case because the our application does not write to historical partitions)

  • removes the old partition from inheritance, attaches the new partition

  • renames the old partition with a suffix

  • renames the new partition as the old one

  • analyses the newly created partition and ends the transaction.

During all these steps, it should print required metrics wherever necessary. Here is what the complete Stored Procedure looks like below:

When the Stored Procedure is ready, we could call it for all the required partitions and increase/decrease batch size based on our database's available resources.

We execute this script below to call the procedure :

With this approach, the rollback plan is ready in case the validation fails. After all the required validations, if we revert, we only need to remove the new partition from inheritance and add the old one back to inheritance. If the validation is all green, we drop the old partition.

In conclusion, navigating the complexities of cleaning up data within partitioned PostgreSQL tables demands a strategic blend of careful planning and the right tools. Leveraging Common Table Expressions (CTEs) proves instrumental in seamlessly traversing partition hierarchies and precisely identifying deletable data.


Fresh insights, straight to your inbox

By submitting your information you confirm that you have read Adyen's Privacy Policy and agree to the use of your data in all Adyen communications.