Skip to content

How to import data as CSV from S3 to DynamoDB using CLI

Posted on:January 16, 2023 at 09:00 AM

This is a guide that describes how to import CSV or JSON data stored in S3 to DynamoDB using the AWS cli.

tl;dr

Interesting facts around importing data from S3 into DynamoDB:

Table of contents

Open Table of contents

0. Motivation

This guide is not about the fastest way to import data into DynamoDB. There are some articles on the internet that describe how to import data much quicker. But all of them require a lot of setup, orchestrating infrastructure and writing code. This guide is about the convenience of only using the import-table command from the AWS CLI to import data.

In a nutshell, importing data is convenient as preparing data as CSV or JSON files and running the import-table command. This is especially useful when testing against DynamoDB using an arbitrary amount of test data.

1. Describing DynamoDB table

The following JSON describes a DynamoDB table that we want to import data to. This guide follows the principle for one table design in DynamoDB.

table.json

{
    "TableName": "dynamodb-import-from-s3",
    "AttributeDefinitions": [
        {
            "AttributeName": "pk",
            "AttributeType": "S"
        },
        {
            "AttributeName": "sk",
            "AttributeType": "S"
        }
    ],
    "KeySchema": [
        {
            "AttributeName": "pk",
            "KeyType": "HASH"
        },
        {
            "AttributeName": "sk",
            "KeyType": "RANGE"
        }
    ],
    "BillingMode": "PAY_PER_REQUEST"
}

Create table via AWS CLI:

aws dynamodb create-table --cli-input-json file://table.json

2. Preparing data to upload

Importing to DynamoDB supports three different file types: CSV, JSON and ION. This guide only explores how to upload CSV and JSON files.

Structure of CSV data

The structure of a CSV is straight forward. The header line has to match the KeySchema defined in table.json

data.csv

pk,sk,title
USER#1,POST#2023-01-06T08:40:25.266Z,"This is a post title"
USER#1,POST#2023-01-07T08:40:25.266Z,"This is another post title"

Structure of JSON data

An entry in the target table might look something like this:

entry.json

{
    "Item": {
        "pk": {
            "S": "USER#1"
        },
        "sk": {
            "S": "POST#2023-01-06T08:40:25.266Z"
        },
        "title": {
            "S": "This is a post title"
        }
    }
}

When importing from JSON files every line represents an entry:

data.json

{"Item":{"pk":{"S":"USER#1"},"sk":{"S":"POST#2023-01-06T08:40:25.266Z"},"title":{"S":"This is a post title"}}}
{"Item":{"pk":{"S":"USER#1"},"sk":{"S":"POST#2023-01-07T08:40:25.266Z"},"title":{"S":"This is another post title"}}}

Difference between uploading CSV and JSON files

What file format to use depends on the complexity of the data. CSV is very limited in structuring data, whereas JSON allows for any kind of attribute definitions that DynamoDB supports.

There is no difference in performance when it comes to importing CSV files versus importing JSON files, even though JSON files are generally larger due to their complexity.

Script to create import data

The following script creates a data.json file and populates it with 10,000 entries of fake data using fakerjs.

index.js

const fs = require("fs");
const { faker } = require("@faker-js/faker");

const out = fs.createWriteStream("data.json");
const user = `USER#${faker.random.alpha(10)}`

const createLine = () => `${JSON.stringify({
    Item: {
        pk: {
            S: user
        },
        sk: {
            S: `POST#${faker.date.birthdate()}` 
        },
        title: {
            S: faker.lorem.sentence(10)
        }
    }
})}\n`;

for (let i = 0; i < 10000; i++) {
    out.write(createLine());
}

3. Creating S3 bucket to import data from

There is no need to create a S3 bucket for importing data. Data can be imported from any S3 bucket. This guides uses its own S3 bucket by running following command:

aws s3api create-bucket --bucket dynamodb-import-001

4. Uploading data to S3

Even though this guide uses its own dedicated bucket for importing to DynamoDB it is recommended to always prefix import data. The following command gzips the source data, uploads it to the target bucket dynamodb-import-001 from the previous step and prefixes it with 001.

gzip -c data.json | aws s3 cp - s3://dynamodb-import-001/001/data.json.gz

Uploading raw data versus gzipped data

There is no performance gain or cost optimization when data is gzipped before being uploaded versus raw data. Importing from raw data and gzipped data takes roughly around the same time. Costs are also calculated after the data is decompressed.

Gzipped files are generally much smaller than their original raw data. Therefore gzipping still has two obvious advantages:

  1. Uploads to S3 are faster
  2. Storage costs on S3 are less

5. Creating Dynamodb table and import data

On the last step everything comes together in a single command that is going to look something like this:

aws dynamodb import-table --s3-bucket-source S3Bucket=dynamodb-import-001,S3KeyPrefix=001/ --input-format DYNAMODB_JSON --table-creation-parameters file://table.json --input-compression-type GZIP

There are a few things happening here:

Time to import data

The time it takes to import data from S3 to DynamoDB depends on the size of imported data. The more items are imported the longer it takes. But it turns out that the import times are not linear and improve the larger the files get.

Here is a table with different sample sizes of data used in this guide.

ItemsTime to import
12m 30s
102m 35s
1002m 40s
1,0002m 45s
10,0003m 00s
100,0005m 00s
1,000,00020m 30s
10,000,00098m 40s