Using Recursive Queries to explore a file system structure stored in PostgreSQL

Unraveling Hierarchical Data with PostgreSQL's Recursive CTEs

July 22, 2024

By Calson Chiatiah

Recursive queries are a powerful feature in SQL that make it easy to work with tree-like or data that is organised in a hierarchy. These types of queries make it possible to navigate through these complex hierarchical data which often consist of layers within layers. In this blog post, we explore how to store and query files in a file system whose structure is stored in PostgreSQL database. Ever come across the WITH RECURSIVE syntax in PostgreSQL?

Introduction

To understand recursive queries in PostgreSQL, it is essential to start by understanding the concept of the Common Table Expressions (CTEs). CTEs can be thought of as defining temporary tables that exist just for one query. Each auxiliary statement in a WITH clause can be a SELECT, INSERT, UPDATE, or DELETE; and the WITH clause itself is attached to a primary statement that can be a SELECT, INSERT, UPDATE, DELETE, or MERGE. Their role often is to break down complex queries into simpler parts that are easy to manage and reason about. In simpler terms, it is a temporary result set which can be referenced in another SQL query.

CTEs in PostgreSQL can be both recursive and non-recursive. In this blog post, we focus on recursive CTEs which make it possible to write recursive queries. If you are interested in learning more about non-recursive CTE queries, you can checkout their docs;

A recursive query is basically a query that refers to itself. Such queries execute repeatedly referring back to its own result set until some condition is met. Recursive CTEs are usually made of two queries: the base query and the recursive query. The recursive query operates on the initial result set gotten from the base query. This process continues until the recursive condition is fulfilled.

Recursive queries are ideal when querying relational data such as family trees, organisational charts, files and directories in a file system, etc.

In this blog post, we focus on using recursive queries to retrieve the hierarchical structure of a file system consisting of files and directories. A basic recursive query in PostgreSQL looks something like this:

WITH RECURSIVE file_system (directory_id, directory_name, ...) AS (
  -- Non-recursive-query
  SELECT ...

  UNION ALL

  -- Recursive-query
  SELECT ... FROM file_system WHERE ...
)
SELECT * FROM file_system;

Let’s explore the different parts of the file_system CTE:

file_system: This is the name of the CTE which can be referenced in subsequent parts of the query.
directory_id, directory_name, ...: These are the columns that will be present in the CTE. They form the structure of the CTE. They can be omitted in the query.
Non-recursive-query: This is the non-recursive query which generates the base result set for the CTE.
UNION/UNION ALL: Responsible for joining the results from the base query and the recursive query.
Recursive query: This represents the recursive query and refers to the CTE. It is responsible for generating additional rows if the WHERE condition is met.

Let’s now construct the query to get all the files and subdirectories in a directory. In this example, both files and directories are placed in the same table with a flag to differentiate them. The directory whose subdirectories and files we want to find is at the root, meaning it is not itself in another directory. So its parent_directory_id, like every other root directory or file, is set to NULL to indicate this.

| id    | filename       | is_directory | parent_directory_id |
| ----- | -------------- | ------------ | ------------------- |
| 1     | directory1     | true         | 5                   |
| 2     | directory2     | true         | 1                   |
| 3     | directory3     | true         | 2                   |
| 4     | file4          | false        | NULL                |
| **5** | **directory5** | **true**     | **NULL**            |
| 6     | directory6     | true         | NULL                |
| 7     | file7          | true         | 6                   |
| 8     | directory8     | true         | NULL                |
| 9     | file9          | true         | 3                   |
| 10    | file10         | false        | 9                   |

Let’s consider the table above which is the representation of a file system. We are interested in getting all files and directories in a directory called directory5. The resulting table from the query should include all files and directories in directory5 and its subdirectories. For instance, we expect the resulting table from the query to include file10 because it is found in a subdirectory of directory5 but not include file7 because file7 is not found directory5 and neither is it in a any of its subdirectories.

Run the following SQL query to create and populate a Postgres table with the above data:

CREATE TABLE directory_structure (
    id SERIAL PRIMARY KEY,
    filename VARCHAR(255),
    is_directory BOOLEAN DEFAULT FALSE,
    parent_directory_id INTEGER REFERENCES directory_structure(id)
);

INSERT INTO directory_structure (filename, is_directory, parent_directory_id)
VALUES
    ('directory1', true, 5),
    ('directory2', true, 1),
    ('directory3', true, 2),
    ('file4', false, NULL),
    ('directory5', true, NULL),
    ('directory6', true, NULL),
    ('file7', true, 6),
    ('directory8', true, NULL),
    ('file9', true, 3),
    ('file10', false, 9);

The following CTE recursive query can be used to get all the files and directories in the directory5 directory.

WITH RECURSIVE file_system AS (
    SELECT f.*
    FROM directory_structure f
    WHERE id = 5

    UNION

    SELECT t.*
    FROM directory_structure t
    JOIN file_system ON t.parent_directory_id = file_system.id
)
SELECT * FROM file_system

When this query is run, it should return the following result set:

| id  | filename   | is_directory | parent_directory_id |
| --- | ---------- | ------------ | ------------------- |
| 5   | directory5 | true         | NULL                |
| 1   | file1      | false        | 5                   |
| 2   | directory2 | true         | 1                   |
| 3   | directory3 | true         | 2                   |
| 9   | directory9 | true         | 3                   |
| 10  | file10     | false        | 9                   |

The recursive CTE file_system defines the non-recursive and a recursive member. The non-recursive member returns the base result set which is the record with the id of 5.

| id  | filename   | is_directory | parent_directory_id |
| --- | ---------- | ------------ | ------------------- |
| 5   | directory5 | true         | NULL                |

The first iteration of the recursive term returns the immediate directories/files in the directory5 directory.

| id  | filename   | is_directory | parent_directory_id |
| --- | ---------- | ------------ | ------------------- |
| 2   | directory2 | true         | 1                   |

This becomes the input for subsequent iterations of the recursive term. Postgres then repeatedly executes the recursive term and in each case, it returns the files/sub-directories at that level.

The final result set is the union of all the result sets. The final result set contains all the files and directories in the directory5 directory and its subdirectories.

Conclusion

In this blog post, we explored the concept of recursive queries in PostgreSQL and demonstrated how to use them to retrieve the hierarchical structure of a directory. This can be useful in several scenarios, such as building a file system explorer.

Don’t let database headaches slow you down. Let experts at Dremonkey design a solution that scales with your business. Contact us now!

In a subsequent blog post, we will build a file system explorer in VueJs which renders the hierarchical structure of a directory queried with a recursive PostgreSQL query.

References: