1

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:

  • Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(

  • Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)

JD D
  • 7,398
  • 2
  • 34
  • 53
prosto.vint
  • 1,403
  • 2
  • 17
  • 30

3 Answers3

1

You can use the option to CTAS with Athena and use the built-in partition capabilities.

A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.

It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').

Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.

CREATE TABLE random_table_name 
WITH (
     format = 'TEXTFILE',  
     external_location = 's3://bucket/folder/', 
     partitioned_by = ARRAY['year','month']) 
AS SELECT ...

When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.

Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:

substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char

Or you can use bucketing and provide how many buckets you want:

WITH (
   format = 'TEXTFILE',  
   external_location = 's3://bucket/folder/', 
   bucketed_by = ARRAY['column_with_high_cardinality'],
   bucket_count = 100
) 
Guy
  • 12,388
  • 3
  • 45
  • 67
0

You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.

Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?

If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.

As for the specific of how to use split, this other stack overflow question may help: How to split CSV files as per number of rows specified?

JD D
  • 7,398
  • 2
  • 34
  • 53
  • it would make sense before uploading a file to S3, but my case - this file is result of Athena query which is storing in S3. For you propose I see two variants - try to use file system in Lambda (not sure about this, need to investigate) or add additional steps like a download CSV, split it on EC2 or some other instance and upload it again – prosto.vint Aug 20 '20 at 11:02
  • The file system in lambda can only hold 512MB, I meant your local file system or one of an ec2/ecs instance – JD D Aug 20 '20 at 11:11
0

You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.

If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.

You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.

Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Theo
  • 131,503
  • 21
  • 160
  • 205