I'm trying to use the GCS compose() method to combine multiple CSV's into one single CSV.
The goal is to use Python's multiprocessing module to consume large CSV files without ever storing all of the data in one place until it gets to GCS. The intention is to use as little memory as possible while operating upon these large files. Ex: file gets broken into 4 parts, one per child process, work is done to the data, and each child process uploads 1/4 of the file to GCS where the compose method will combine each of the four files to create one CSV.
Panda's won't work because I am trying to avoid having the data all in one place (too much memory is consumed).
Below is the issue I am running into when trying to combine the CSV's at the end.
CSV 1:
col1,col2,col3
1,2,3
1,2,3
1,2,3
CSV 2:
col4,col5,col6
4,5,6
4,5,6
4,5,6
When I use the compose() method from the GCS API, I get this as my result in the destination file:
col1,col2,col3
1,2,3
1,2,3
1,2,3
col4,col5,col6
4,5,6
4,5,6
4,5,6
But what I am looking for is this:
col1,col2,col3,col4,col5,col6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
The code to produce:
bucket = STORAGE_CLIENT.bucket(bucket_name)
destination = bucket.blob(destination_blob_name)
destination.content_type = "text/csv"
destination.compose(sources)
Does anyone have any suggestions on how I can merge the CSV's the way I want to?