Context:
I'm currently working on a data pipeline project where the aim is to sync MongoDB documents into a SQL server. However, the schema and data are on separate documents in the form shown below:
Schema Document:
{
schema_id: someHex,
key_map:\[
0: {
key: "a",
column_name: "colName1",
type: "Int"
},
1: {
key: "b",
column_name: "colName2",
type: "String"
},
2: {
key: "c",
column_name: "colName3",
type: "Boolean"
}
]
}
Data Document:
{
data_id: someHex,
schema_id: someHex,
data: [
0: {
key: "a",
value: 252
},
1: {
key: "b",
value: "some_string"
},
2: {
key: "c",
value: True
}
]
}
The destination table should look like this:
schema_id | colName1 | colName2 | colName3 |
---|---|---|---|
someHex | 252 | some_string | 1 |
Question:
The issue lies in finding a convenient/maintainable way to sync such data using tools like Fivetran/Airbytes to help handle error management and the like. Been researching various open-source tools and have found none that offer such capabilities so help will be very much appreciated.
I have currently implemented the sync which goes like this:
MongoDB -> Airbytes -> S3 -> Lambda -> SQL
Sync mongo data to S3 using Airbytes, an AWS lambda function then transforms/normalizes the Json files into a columnar form which then gets pushed to a SQL server