Archive S3 files and save them back in S3 in a stream

Learn how to stream files from S3, create a streamed archive or the streamed files and upload to S3 in one go

June 1, 2023

By Calson Chiatiah

Oftentimes, we want to create an archive from S3 files and save this archive in S3. Consider a scenario where you have several S3 files which you often serve to your users as a zip and this zip is usually several gigabytes in size. If you need to serve this zip quite often to several users then it will quickly become inefficient to do the zipping on the fly when a user needs it.

Instead, a more convenient approach could be to zip the files beforehand. You could probably zip them during the first request to the zip and during subsequent requests you could just serve a Presigned Get URL for the zip. Though this could greatly reduce download time for your users, it consumes your S3 storage and duplicates data. You could consider deleting the S3 files after you archived them or set up other mechanisms to mitigate the situation depending on your needs and requirements. This tutorial will guide you on how to zip several S3 files in an S3 bucket and save the zip back in S3 in one go using streams.

This tutorial assumes you are familiar with creating buckets in S3. We will use the S3 v3 SDK client. We also do not explain the details of every S3 command used except in cases where necessary for the sake of comprehension of the tutorial. Even if you have not used S3 to read files before, it should be fairly easy to follow through. You can read more about the S3 SDK client here.

Installing the necessary packages

For this tutorial, we would need to install S3 v3 Client and archiver packages. Archiver is a streaming interface for archive generation. We used PNPM here for installing the packages. Any other node package manager should work fine.

pnpm add @aws-sdk/client-s3
pnpm add @aws-sdk/lib-storage
pnpm add archiver
# types for archiver because we are using TypeScript
pnpm add -D @types/archiver

Implementation

It is worth noting that in this tutorial we do not download the entire file and then archive it before uploading the archive to S3. Instead, a decent approach is to stream the files from S3, generate an archive from the stream, and continuously push this archived stream back to S3 until we are done in one go. The GetObjectCommand of @aws-sdk/client-s3 is used to read the streams, archiver is used to create an archived stream, and Upload of @aws-sdk/lib-storage uploads the archived stream to S3. In this tutorial, the format of the produced archive is a zip. We could use any other format supported by archiver.

We will wrap the entire functionality in a class with a public API that does the job. Before defining a constructor for this class, let’s define an interface for an S3 file to be zipped.

// You could use the same filename from the key as the name of the zip.
// Here we will explicitly set the name.
interface FileToZip {
  // S3 file key
  key: string;
  // filename when zipped
  // The file extension in key name should match the file extension of filename
  // For simplicity, we ignore these extra details.
  filename: string;
}

Even though we could use the PutObjectCommand of @aws-sdk/client-s3, to upload the stream, we will use its counterpart Upload from @aws-sdk/lib-storage. Upload is ideal for efficiently uploading large blobs, buffers or streams. Unlike PutObjectCommand, you don’t need to specify the content length when using Upload. Moreover, Upload is recommended for uploading large file sizes.

import { PassThrough, Readable } from 'node:stream';
import { GetObjectCommand, S3Client } from '@aws-sdk/client-s3';
import { Upload } from '@aws-sdk/lib-storage';
import archiver from 'archiver';

class S3Archiver {
  private readonly s3client: S3Client;
  private readonly bucketName: string;

  constructor() {
    // These values should be securely stored OUTSIDE versioned code.
    // Usually they are stored as environment variables,
    // we reference them here as plain strings for simplicity.
    this.bucketName = 'my-bucket-name';
    this.s3client = new S3Client({
      region: 'my-aws-region',
      credentials: {
        accessKeyId: 'my-aws-access-key-id',
        secretAccessKey: 'my-aws-secret-access-key',
      },
    });
  }

  // ... methods of this class are defined below
}

Below, we define the first class method which gets the ReadableStream from S3. getStreamFromS3Files creates and returns an iterable of promises with files being streamed.

class S3Archiver {
  // ... constructor is defined above

  async #getStreamFromS3Files(files: FileToZip[]) {
    return Promise.all(
      files.map(async ({ key, filename }) => {
        const { Body } = await this.s3client.send(
          new GetObjectCommand({
            Bucket: this.bucketName,
            Key: key,
          })
        );

        // The new AWS v3 API returns a mixed type for the Body property which doesn't account for the execution environment
        // In Node it will be Readable, while in the browser it will be a Blob or ReadableStream
        // We have to manually check for the type of the Body property since helpers of many backend frameworks (e.g. NestJS)
        // don’t support ReadableStream or Blob and require a Readable
        // Information on this regards are fragmented, but many pieces can be found here: https://github.com/aws/aws-sdk-js-v3/issues/1877
        const body = response.Body;

        if (!(body instanceof Readable)) {
          throw new Error("File not found or Body isn't a readable stream");
        }

        return {
          filename,
          stream: Body,
        };
      })
    );
  }

  // ... other methods of this class are defined below
}

The next method, uploadStream, uploads a stream to S3. In our case, it will upload the archived stream to S3.

class S3Archiver {
  // ... constructor and other methods are defined above

  async #uploadStream(destination: string, bodyStream: PassThrough) {
    const uploadStream = new Upload({
      client: this.s3client,
      params: {
        Bucket: this.bucketName,
        Key: destination,
        Body: bodyStream,
        ContentType: 'application/zip',
      },
    });

    return uploadStream.done();
  }

  // ... other methods of this class are defined below
}

Finally, let’s write the last method which uses uploadStream to stream files from S3 and save the zipped stream back in S3. The whole process generally involves creating iterable stream promises from the files in S3 and then generating a zipped version of these streams as they arrive. The zipped streams are then piped to a PassThrough stream and the PassThrough stream is pushed to S3. This process continues until all the files have been completely streamed from S3, zipped and uploaded in a single process.

You can read more about PassThrough here but in simple terms, it passes its input bytes across to the output. Through the rest of this tutorial, we will refer to the stream from a PassThrough instance as “PassThrough stream”.

class S3Archiver {
  // ... constructor and other methods are defined above

  async saveZipToS3(filesToZip: FilesToZip[], destination: string) {
    const s3FileStreams = await this.getStreamFromS3Files(filesToZip);
    const stream = new PassThrough();

    return new Promise((resolve, reject) => {
      this.uploadFromStream(`${destination}.zip`, stream)
        // it will resolve to true if all goes well
        .then(() => resolve(true))
        .catch((err) => reject(err));

      // archiver could be used to generate TARs too.
      // We are using a compression level of 5 here.
      // Check [their docs](https://www.archiverjs.com/docs/archive-formats) for other options.
      const archive = archiver('zip', { zlib: { level: 5 } });

      archive.on('error', (err) => reject(err));
      archive.on('warning', (err) => console.error(err));

      try {
        archive.pipe(stream);
        for (const downStream of s3FileStreams) {
          archive.append(downStream.stream, {
            name: downStream.filename,
          });
        }
        archive.finalize();
      } catch (error) {
        throw new Error('Error occurred while archiving a stream', error);
      }
    });
  }
}

We start off by reading the streams from S3 by calling getStreamFromS3Files and then creating an instance of PassThrough.
Let’s move on to the Promise in generateZipAndSaveToS3. The line

this.uploadFromStream(`${destination}.zip`, stream);

initiates the upload of a PassThrough stream which is handled by uploadFromStream. Note that the stream from getStreamFromS3Files is going to be piped into the PassThrough stream by archiver and Upload of @aws-sdk/lib-storage will upload the PassThrough stream to S3. Next, we create an instance of archiver with necessary error handling. The line

archive.pipe(stream);

of the try block sets the PassThrough stream as the output to which the stream from archiver will be sent. The loop after this line goes through the iterable streams from getStreamFromS3Files and appends each stream to archive. This will in turn pipe the archived streams to the PassThrough reference and uploadStream will upload the archived streams to S3 as they come in. Once archiving is done, archive.finalize() is called to indicate that streaming has ended. And when uploadStream finishes uploading the last byte, uploadStream.done() will resolve, and resolve of the Promise will be called. All this happens in one go. We do not stop at any point in the process.

To use the S3Archiver, we create an instance of it and then call its saveZipToS3 method.

const s3Archiver = new S3Archiver();

// If you would want to use the same file name from the key, then you have to update the code to handle this properly.
// For simplicity, we have ignored this.
s3Archiver.saveZipToS3(
  [
    {
      key: 'file-1.png',
      filename: 'new-file-1-name.png',
    },
    {
      key: 'file-2.jpg',
      filename: 'new-file-2-name.jpg',
    },
  ],
  // do not add the extension here
  'bucket-destination-to-save-file'
);

If you are using the S3 v2 SDK client, the process will be very similar except that getObject of v2 does not return a ReadableStream. You will have to chain promise() chain on getObject.