0

I download a zip file from AWS S3 and unzip it. Upon unzipping, all files are saved in the tmp/ folder.

s3 = boto3.client('s3')

s3.download_file('testunzipping','DataPump_10000838.zip','/tmp/DataPump_10000838.zip')

with zipfile.ZipFile('/tmp/DataPump_10000838.zip', 'r') as zip_ref:
    zip_ref.extractall('/tmp/')
    lstNEW = zip_ref.namelist()

The output of listNEW is something like this:

['DataPump_10000838/', '__MACOSX/._DataPump_10000838', 'DataPump_10000838/DockBooking', '__MACOSX/DataPump_10000838/._DockBooking', 'DataPump_10000838/LoadEquipment', '__MACOSX/DataPump_10000838/._LoadEquipment', ....]

LoadEquipment and DockBooking are files but the rest are not. Is it possible to unzip the file without creating those temporary files? Or is I possible to filter out the real files? Because later, I need to use the correct files and gzip them.

$item_$unixepochtimestamp.csv.gz

Do I use the compress function?

x89
  • 2,798
  • 5
  • 46
  • 110
  • 1
    I'm not sure exactly what your question is, or how you're using these files once they're unzipped. BUT... perhaps you simply want to unzip to memory (vs. writing to /tmp): https://stackoverflow.com/a/10909016/421195. In any case, it sounds like you DEFINITELY don't want "extractall()". Look here for alternatives: https://docs.python.org/3/library/zipfile.html – paulsm4 Oct 18 '21 at 20:13
  • once they are unzipped, I want to convert them into gzip and store in another s3 bucket. How can I achieve this with read? I mean how can I gzip all the unzipped files without downloading/extracting them? @paulsm4 Also, I was writing to /tmp bc of this answer https://stackoverflow.com/a/69586599/12304000 – x89 Oct 18 '21 at 22:28

1 Answers1

2

To only extract certain files, you can pass a list to extractall:

with zipfile.ZipFile('/tmp/DataPump_10000838.zip', 'r') as zip_ref:
    lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
    zip_ref.extractall('/tmp/', members=lstNEW)

The files are not temporary files, but rather macOS's way of representing resource forks in zip files that don't normally support this.

that other guy
  • 116,971
  • 11
  • 170
  • 194
  • if I run the lambda function in some automatic pipeline on AWS, would this still work? I mean will it always be macOS? – x89 Oct 19 '21 at 08:03
  • Obviously this is a non-standard macOS convention that Apple can change whenever they feel like it, but so far they have always represented resource forks under the special name `__MACOSX` – that other guy Oct 19 '21 at 17:26