0

I am trying to transfer large files from S3 to GCP using Airflow and its Operator S3ToGoogleCloudStorageOperator. I have been able to transfer files of 400 Mb but I fail if I try larger: 2Gb I get the following error:

[2018-09-19 12:30:43,907] {models.py:1736} ERROR - [Errno 28] No space left on device Traceback (most recent call last):
File "/home/jma/airflow/env/lib/python3.5/site-packages/airflow/models.py", line 1633, in _run_raw_task result = task_copy.execute(context=context)
File "/home/jma/airflow/env/lib/python3.5/site-packages/airflow/contrib/operators/s3_to_gcs_operator.py", line 156, in execute file_object.download_fileobj(f)
File "/home/jma/airflow/env/lib/python3.5/site-packages/boto3/s3/inject.py", line 760, in object_download_fileobj ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
File "/home/jma/airflow/env/lib/python3.5/site-packages/boto3/s3/inject.py", line 678, in download_fileobj return future.result()
File "/home/jma/airflow/env/lib/python3.5/site-packages/s3transfer/futures.py", line 73, in result return self._coordinator.result()
File "/home/jma/airflow/env/lib/python3.5/site-packages/s3transfer/futures.py", line 233, in result raise self._exception
File "/home/jma/airflow/env/lib/python3.5/site-packages/s3transfer/tasks.py" , line 126, in call return self._execute_main(kwargs)
File "/home/jma/airflow/env/lib/python3.5/site-packages/s3transfer/tasks.py", line 150, in _execute_main return_value = self._main(**kwargs)
File "/home/jma/airflow/env/lib/python3.5/site-packages/s3transfer/download.py", line 583, in _main fileobj.write(data)
File "/home/jma/airflow/env/lib/python3.5/tempfile.py", line 622, in func_wrapper return func(*args, **kwargs) OSError: [Errno 28] No space left on device

The full code of the DAG can be found in this other SO question.

The file does not go direct from S3 to GCP but is downloaded to the machine where Airflow is running. Looking at the traces it seems boto could be responsible but still can't figure out how to fix the issue, that is assign a folder for the file to be copied temporarily.

I would like to move files very large so, how to setup so that there is no limitation imposed?

I am running Airflow 1.10 from Google Cloud Shell in GCP, where I have 4 Gb of free space in the home directory (the file being moved is 2Gb)

Picarus
  • 760
  • 1
  • 10
  • 25
  • Try solutions mentioned in: https://stackoverflow.com/questions/6998083/python-causing-ioerror-errno-28-no-space-left-on-device-results-32766-h – kaxil Sep 19 '18 at 08:43
  • How much physical memory does the instance have? – cwurtz Sep 19 '18 at 13:29
  • @cwurtz, the Google Cloud Shell runs on a g1-small instance that has 1.7Gb, what having a 2Gb file could maybe be an issue as suggested in the link contributed by kaxil – Picarus Sep 20 '18 at 00:20

1 Answers1

0

I think the best option is to use the transfer service of Google Cloud Storage. You can easily move data from S3 to GCP [1]. I think the volume of information is not a problem, however, keep in mind the limitations about requests number [2]

[1] https://cloud.google.com/storage-transfer/docs/ [2] https://cloud.google.com/storage-transfer/quotas

  • yes, we had seen that option and I am implementing a test now. The issue I foresee (call me pessimistic) is that we will have a scheduler inside another scheduler as we will need to keep using Airflow for the overall process. Do you know of any projects doing something similar so that we avoid falling in the same traps? – Picarus Sep 20 '18 at 00:17
  • Unfortunately, I don't know another project/tool to accomplish this. What if you perform this big transfers by chunks? The error: "No space left on device", Surely is due to when a file is uploaded to GCS the information is cached in temporal folders that need space. Another option could be running this on a machine with more resources. – ETDeveloper Sep 20 '18 at 16:51
  • thanks for your comments, @ETDeveloper. The trick with using Google Transfer is not to specify a time and that launches (to the extent my limited test shows) the transfer immediately, so you can basically ignore the scheduler on Google Transfer, and then you need to keep track of transfer status to know the operation has completed. – Picarus Sep 20 '18 at 23:55
  • Yes, that's an important limitation, now I understand why you use Airflow. Maybe you can create a feature request for Google Transfer Service [1]. I know this is not an instant solution but it will help GCP engineering team to have this in their radar [1] https://cloud.google.com/support/docs/issue-trackers#feature_requests – ETDeveloper Sep 26 '18 at 19:21