How to import data as CSV from S3 to DynamoDB using CLI

This is a guide that describes how to import CSV or JSON data stored in S3 to DynamoDB using the AWS cli.

tl;dr

Interesting facts around importing data from S3 into DynamoDB:

Gzipping your data does not lead to faster import times compared to raw files.
Import time is the same for both CSV and JSON data.
Importing from S3 creates a new DynamoDB table. There is no option to append data to an existing table.
Importing from S3 is less expensive in terms of money compared to batch writing to DynamoDB
Cost of importing is calculated after data is decompressed. Gzipped or raw data will always cost the same.

Open Table of contents

0. Motivation
1. Describing DynamoDB table
2. Preparing data to upload
3. Creating S3 bucket to import data from
4. Uploading data to S3
- Uploading raw data versus gzipped data
5. Creating Dynamodb table and import data
- Time to import data

0. Motivation

This guide is not about the fastest way to import data into DynamoDB. There are some articles on the internet that describe how to import data much quicker. But all of them require a lot of setup, orchestrating infrastructure and writing code. This guide is about the convenience of only using the import-table command from the AWS CLI to import data.

In a nutshell, importing data is convenient as preparing data as CSV or JSON files and running the import-table command. This is especially useful when testing against DynamoDB using an arbitrary amount of test data.

1. Describing DynamoDB table

The following JSON describes a DynamoDB table that we want to import data to. This guide follows the principle for one table design in DynamoDB.

table.json

{
    "TableName": "dynamodb-import-from-s3",
    "AttributeDefinitions": [
        {
            "AttributeName": "pk",
            "AttributeType": "S"
        },
        {
            "AttributeName": "sk",
            "AttributeType": "S"
        }
    ],
    "KeySchema": [
        {
            "AttributeName": "pk",
            "KeyType": "HASH"
        },
        {
            "AttributeName": "sk",
            "KeyType": "RANGE"
        }
    ],
    "BillingMode": "PAY_PER_REQUEST"
}

Create table via AWS CLI:

aws dynamodb create-table --cli-input-json file://table.json

2. Preparing data to upload

Importing to DynamoDB supports three different file types: CSV, JSON and ION. This guide only explores how to upload CSV and JSON files.

Structure of CSV data

The structure of a CSV is straight forward. The header line has to match the KeySchema defined in table.json

data.csv

pk,sk,title
USER#1,POST#2023-01-06T08:40:25.266Z,"This is a post title"
USER#1,POST#2023-01-07T08:40:25.266Z,"This is another post title"

Structure of JSON data

An entry in the target table might look something like this:

entry.json

{
    "Item": {
        "pk": {
            "S": "USER#1"
        },
        "sk": {
            "S": "POST#2023-01-06T08:40:25.266Z"
        },
        "title": {
            "S": "This is a post title"
        }
    }
}

When importing from JSON files every line represents an entry:

data.json

{"Item":{"pk":{"S":"USER#1"},"sk":{"S":"POST#2023-01-06T08:40:25.266Z"},"title":{"S":"This is a post title"}}}
{"Item":{"pk":{"S":"USER#1"},"sk":{"S":"POST#2023-01-07T08:40:25.266Z"},"title":{"S":"This is another post title"}}}

Difference between uploading CSV and JSON files

What file format to use depends on the complexity of the data. CSV is very limited in structuring data, whereas JSON allows for any kind of attribute definitions that DynamoDB supports.

There is no difference in performance when it comes to importing CSV files versus importing JSON files, even though JSON files are generally larger due to their complexity.

Script to create import data

The following script creates a data.json file and populates it with 10,000 entries of fake data using fakerjs.

index.js

const fs = require("fs");
const { faker } = require("@faker-js/faker");

const out = fs.createWriteStream("data.json");
const user = `USER#${faker.random.alpha(10)}`

const createLine = () => `${JSON.stringify({
    Item: {
        pk: {
            S: user
        },
        sk: {
            S: `POST#${faker.date.birthdate()}` 
        },
        title: {
            S: faker.lorem.sentence(10)
        }
    }
})}\n`;

for (let i = 0; i < 10000; i++) {
    out.write(createLine());
}

3. Creating S3 bucket to import data from

There is no need to create a S3 bucket for importing data. Data can be imported from any S3 bucket. This guides uses its own S3 bucket by running following command:

aws s3api create-bucket --bucket dynamodb-import-001

4. Uploading data to S3

Even though this guide uses its own dedicated bucket for importing to DynamoDB it is recommended to always prefix import data. The following command gzips the source data, uploads it to the target bucket dynamodb-import-001 from the previous step and prefixes it with 001.

gzip -c data.json | aws s3 cp - s3://dynamodb-import-001/001/data.json.gz

Uploading raw data versus gzipped data

There is no performance gain or cost optimization when data is gzipped before being uploaded versus raw data. Importing from raw data and gzipped data takes roughly around the same time. Costs are also calculated after the data is decompressed.

Gzipped files are generally much smaller than their original raw data. Therefore gzipping still has two obvious advantages:

Uploads to S3 are faster
Storage costs on S3 are less

5. Creating Dynamodb table and import data

On the last step everything comes together in a single command that is going to look something like this:

aws dynamodb import-table --s3-bucket-source S3Bucket=dynamodb-import-001,S3KeyPrefix=001/ --input-format DYNAMODB_JSON --table-creation-parameters file://table.json --input-compression-type GZIP

There are a few things happening here:

--s3-bucket-source S3Bucket=dynamodb-import-001,S3KeyPrefix=001/ points to the S3 bucket created in step 3 and specifies the prefix
--input-format DYNAMODB_JSON specifies that the import data is in JSON format. Other values are CSV and ION.
--table-creation-parameters file://table.json points to the table.json file created in step 1.
--input-compression-type GZIP tells AWS that the uploaded data is gzipped.

Time to import data

The time it takes to import data from S3 to DynamoDB depends on the size of imported data. The more items are imported the longer it takes. But it turns out that the import times are not linear and improve the larger the files get.

Here is a table with different sample sizes of data used in this guide.

Items	Time to import
1	2m 30s
10	2m 35s
100	2m 40s
1,000	2m 45s
10,000	3m 00s
100,000	5m 00s
1,000,000	20m 30s
10,000,000	98m 40s