How to Load large Dataset from S3 to Opensearch

Question

I am following the https://docs.aws.amazon.com/opensearch-service/latest/developerguide/integrations.html guidance, but if i want to load the data have a lot of rows, the lambda will timeout around 15 minutes, so i am not sure if there is another way to quick upload the dataset from S3 to Opensearch. using AWS service. thanks in advance. PYTHON example if there is will be appreciate!

Did you follow the guide only to the lambda or did you consider the other options too? Like for example the Kinesis Data Firehose? Also, if you want to use lambda and you want to use s3 you have just make sure that you split your data in chunks which the lambda can process in less than 15 minutes. — petrch, Feb 16 '22 at 23:19
hey, yeah, i am using the chunks right now. so first, my dataset is not stream, just like a large dataset several GBs refresh every morning, and i need to load into the opensearch in the morning. — Jochen, Feb 17 '22 at 14:19
And another thing i noticed, when i trigger around 200 lambda run, around 50 will failed, i guess it is because i am using the smaller instance for opensearch. or if anyone have the experience on it. thanks. — Jochen, Feb 17 '22 at 16:29
I have seen elsewhere https://stackoverflow.com/questions/36826352/aws-lambda-toomanyrequestsexception-rate-exceeded - the basic lambda limit is 100 parallel executions. Ask for more if you need more. — petrch, Feb 17 '22 at 19:53
Yeah, i guess it is also impact by the opensearch instance size, so that i limit to 40 concurrency. — Jochen, Feb 17 '22 at 21:13
i am trying to find a way for uploading/refreshing the large dataset(several GBs) no need to use lambda only, other batch/any other aws services exmaple is appreciate. — Jochen, Feb 17 '22 at 21:14

score 0 · Answer 1 · answered Jul 18 '23 at 18:51

AWS offers Amazon OpenSearch Ingestion Pipelines, which takes a yaml file that can specify an s3 source and then a sink, which can either be an OpenSearch Service domain or an OpenSearch Service collection (serverless). Since S3 uses a pull-strategy, the Role of the Ingestion Pipeline will require permissions to access the data in s3.

How to Load large Dataset from S3 to Opensearch

1 Answers1