This is a guide that describes how to import CSV or JSON data stored in S3 to DynamoDB using the AWS cli.
tl;dr
Interesting facts around importing data from S3 into DynamoDB:
- Gzipping your data does not lead to faster import times compared to raw files.
- Import time is the same for both CSV and JSON data.
- Importing from S3 creates a new DynamoDB table. There is no option to append data to an existing table.
- Importing from S3 is less expensive in terms of money compared to batch writing to DynamoDB
- Cost of importing is calculated after data is decompressed. Gzipped or raw data will always cost the same.
Table of contents
Open Table of contents
0. Motivation
This guide is not about the fastest way to import data into DynamoDB. There are some articles on the internet that describe how to import data much quicker. But all of them require a lot of setup, orchestrating infrastructure and writing code. This guide is about the convenience of only using the import-table
command from the AWS CLI to import data.
In a nutshell, importing data is convenient as preparing data as CSV or JSON files and running the import-table
command. This is especially useful when testing against DynamoDB using an arbitrary amount of test data.
1. Describing DynamoDB table
The following JSON describes a DynamoDB table that we want to import data to. This guide follows the principle for one table design in DynamoDB.
table.json
{
"TableName": "dynamodb-import-from-s3",
"AttributeDefinitions": [
{
"AttributeName": "pk",
"AttributeType": "S"
},
{
"AttributeName": "sk",
"AttributeType": "S"
}
],
"KeySchema": [
{
"AttributeName": "pk",
"KeyType": "HASH"
},
{
"AttributeName": "sk",
"KeyType": "RANGE"
}
],
"BillingMode": "PAY_PER_REQUEST"
}
Create table via AWS CLI:
aws dynamodb create-table --cli-input-json file://table.json
2. Preparing data to upload
Importing to DynamoDB supports three different file types: CSV, JSON and ION. This guide only explores how to upload CSV and JSON files.
Structure of CSV data
The structure of a CSV is straight forward. The header line has to match the KeySchema
defined in table.json
data.csv
pk,sk,title
USER#1,POST#2023-01-06T08:40:25.266Z,"This is a post title"
USER#1,POST#2023-01-07T08:40:25.266Z,"This is another post title"
Structure of JSON data
An entry in the target table might look something like this:
entry.json
{
"Item": {
"pk": {
"S": "USER#1"
},
"sk": {
"S": "POST#2023-01-06T08:40:25.266Z"
},
"title": {
"S": "This is a post title"
}
}
}
When importing from JSON files every line represents an entry:
data.json
{"Item":{"pk":{"S":"USER#1"},"sk":{"S":"POST#2023-01-06T08:40:25.266Z"},"title":{"S":"This is a post title"}}}
{"Item":{"pk":{"S":"USER#1"},"sk":{"S":"POST#2023-01-07T08:40:25.266Z"},"title":{"S":"This is another post title"}}}
Difference between uploading CSV and JSON files
What file format to use depends on the complexity of the data. CSV is very limited in structuring data, whereas JSON allows for any kind of attribute definitions that DynamoDB supports.
There is no difference in performance when it comes to importing CSV files versus importing JSON files, even though JSON files are generally larger due to their complexity.
Script to create import data
The following script creates a data.json
file and populates it with 10,000 entries of fake data using fakerjs.
index.js
const fs = require("fs");
const { faker } = require("@faker-js/faker");
const out = fs.createWriteStream("data.json");
const user = `USER#${faker.random.alpha(10)}`
const createLine = () => `${JSON.stringify({
Item: {
pk: {
S: user
},
sk: {
S: `POST#${faker.date.birthdate()}`
},
title: {
S: faker.lorem.sentence(10)
}
}
})}\n`;
for (let i = 0; i < 10000; i++) {
out.write(createLine());
}
3. Creating S3 bucket to import data from
There is no need to create a S3 bucket for importing data. Data can be imported from any S3 bucket. This guides uses its own S3 bucket by running following command:
aws s3api create-bucket --bucket dynamodb-import-001
4. Uploading data to S3
Even though this guide uses its own dedicated bucket for importing to DynamoDB it is recommended to always prefix import data. The following command gzips the source data, uploads it to the target bucket dynamodb-import-001
from the previous step and prefixes it with 001
.
gzip -c data.json | aws s3 cp - s3://dynamodb-import-001/001/data.json.gz
Uploading raw data versus gzipped data
There is no performance gain or cost optimization when data is gzipped before being uploaded versus raw data. Importing from raw data and gzipped data takes roughly around the same time. Costs are also calculated after the data is decompressed.
Gzipped files are generally much smaller than their original raw data. Therefore gzipping still has two obvious advantages:
- Uploads to S3 are faster
- Storage costs on S3 are less
5. Creating Dynamodb table and import data
On the last step everything comes together in a single command that is going to look something like this:
aws dynamodb import-table --s3-bucket-source S3Bucket=dynamodb-import-001,S3KeyPrefix=001/ --input-format DYNAMODB_JSON --table-creation-parameters file://table.json --input-compression-type GZIP
There are a few things happening here:
--s3-bucket-source S3Bucket=dynamodb-import-001,S3KeyPrefix=001/
points to the S3 bucket created in step 3 and specifies the prefix--input-format DYNAMODB_JSON
specifies that the import data is in JSON format. Other values areCSV
andION
.--table-creation-parameters file://table.json
points to thetable.json
file created in step 1.--input-compression-type GZIP
tells AWS that the uploaded data is gzipped.
Time to import data
The time it takes to import data from S3 to DynamoDB depends on the size of imported data. The more items are imported the longer it takes. But it turns out that the import times are not linear and improve the larger the files get.
Here is a table with different sample sizes of data used in this guide.
Items | Time to import |
---|---|
1 | 2m 30s |
10 | 2m 35s |
100 | 2m 40s |
1,000 | 2m 45s |
10,000 | 3m 00s |
100,000 | 5m 00s |
1,000,000 | 20m 30s |
10,000,000 | 98m 40s |