1

I need to transfer the data from S3 bucket to GCP bucket. I convert s3 file to DataFrame with pandas, then i make a parquet file and upload it to GCP bucket but this does not work... the last line of code is the one that I understand that I am not getting to work I have this error: pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 3

import boto3
import io
from google.cloud import storage
import os
import pandas as pd

buffer = io.BytesIO()
s3 = boto3.resource('s3', 
                      aws_access_key_id='MyKey',
                      aws_secret_access_key='MySecretKey')
object=s3.Object('my_bucket_s3','2022/test.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)

client = connections["My-Connection"].storage_client

client = storage.Client()
bucket = client.get_bucket('my_bucket_gcp')
    
bucket.blob('TEST/test.parquet').upload_from_string(df.to_parquet(), 'parquet')
  • I would suggest to not go this route. You should upload just the parquet file. Dataframe is not a fileformat, so the error is expected. End users who wish to use this parquet file can leverage pandas/pyspark to construct a dataframe. I hope that helps. https://stackoverflow.com/questions/37003862/how-to-upload-a-file-to-google-cloud-storage-on-python-3 – teedak8s Jun 29 '22 at 18:40
  • how can i download and upload the file or transfer from S3 to GCP? – Rodrigo Maurin Lopez Jun 29 '22 at 19:10

0 Answers0