Using out-of-the-box dependencies will require writing the JSON-to-Parquet conversion to a local file. Then, you can stream-read the file and upload to S3.
AWS Lambda includes a 512 MB temporary file system (/tmp
) for your code and doesn't cause any performance hits. Depending on the size of your payload you may need to increase it, up to 10 GB.
Pseudo-code (1):
const fs = require("fs");
const bodyRequest = {
id: 1,
payload: [
{
payloadid: 1,
name: "name-1",
value: "value-1",
},
{
payloadid: 2,
name: "name-2",
value: "value-2",
},
],
};
const schema = new parquet.ParquetSchema({
id: { type: "UTF8" },
payload: {
repeated: true,
fields: {
payloadid: { type: "UTF8" },
name: { type: "UTF8" },
value: { type: "UTF8" },
},
},
});
const writer = await parquet.ParquetWriter.openFile(
schema,
"/tmp/example.parquet"
);
await writer.appendRow({
id: bodyRequest["id"],
payload: bodyRequest["payload"],
});
await writer.close();
const fileStream = fs.createReadStream("/tmp/example.parquet");
const s3Key = "2022/07/07/example.parquet";
try {
const params = { Bucket: "bucket", Key: s3Key, Body: fileStream };
const result = await s3.putObject(params).promise();
fs.unlink("/tmp/example.parquet", function (err) {
if (err) {
console.error(err);
}
console.log("File has been Deleted");
});
console.log(result);
} catch (e) {
console.error(e);
}
Depending on the throughput of requests, you may need an SQS between services to perform the batch transformation. For example:
Request -> Lambda -> S3/json -> S3 Notification -> SQS
and batch 50 messages -> Lambda transformation -> S3/parquet
Another solution would be using AWS Glue to transform S3 objects from JSON to Parquet: https://hkdemircan.medium.com/how-can-we-json-css-files-transform-to-parquet-through-aws-glue-465773b43dad
The flow would be: Request -> Lambda -> S3/json
and S3/json <- Glue Crawler -> S3/parquet
. You can do that via scheduled (every X minutes) or trigger it via S3 events.