4

I'm trying to use the GCS compose() method to combine multiple CSV's into one single CSV.

The goal is to use Python's multiprocessing module to consume large CSV files without ever storing all of the data in one place until it gets to GCS. The intention is to use as little memory as possible while operating upon these large files. Ex: file gets broken into 4 parts, one per child process, work is done to the data, and each child process uploads 1/4 of the file to GCS where the compose method will combine each of the four files to create one CSV.

Panda's won't work because I am trying to avoid having the data all in one place (too much memory is consumed).

Below is the issue I am running into when trying to combine the CSV's at the end.

CSV 1:

col1,col2,col3
1,2,3
1,2,3
1,2,3

CSV 2:

col4,col5,col6
4,5,6
4,5,6
4,5,6

When I use the compose() method from the GCS API, I get this as my result in the destination file:

col1,col2,col3
1,2,3
1,2,3
1,2,3
col4,col5,col6
4,5,6
4,5,6
4,5,6

But what I am looking for is this:

col1,col2,col3,col4,col5,col6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6

The code to produce:

bucket = STORAGE_CLIENT.bucket(bucket_name)
destination = bucket.blob(destination_blob_name)
destination.content_type = "text/csv"
destination.compose(sources)

Does anyone have any suggestions on how I can merge the CSV's the way I want to?

2 Answers2

1

Because Cloud Storage compose doesn't allow to achieve what you want (compose is stupid, it takes one file and appends the content of the following to the previous one!) and if you have a large amount of data, I recommend you to use BigQuery!

  • Create an federated table on each type of file that you have
  • Perform an extract data query
EXPORT DATA OPTIONS(
  uri='gs://my_bucket/*.csv',
  format='CSV',
  overwrite=true,
  header=true,
  field_delimiter=',') AS
SELECT * FROM `my_federated_table` JOIN/UNION ....

You will have several files in output, but with the correct format. If you want only one file in output, set the header to false and use compose to merge the files.

guillaume blaquiere
  • 66,369
  • 2
  • 47
  • 76
0

Looking at the Google Cloud Client Libraries for Cloud Storage don't support this feature.

In such cases you may use pandas, s tool used for data manipulation. This way, you may concatenate the columns of the two .csv files in the format you wish.

Here is also a Stack Overflow post with a code example on how to do the above.

Hope you find the above helpful.

tzovourn
  • 1,293
  • 8
  • 18