0

I have several sub-folders, each of which containing twitter files which are zipped. I want python to iterate through these sub-folders and turn them into regular JSON files. I have more than 300 sub-folders, each of which containing about 1000 or more of these zipped files. A sample of these files is named: 00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D"

Thanks in advance

I have tried the codes below, just to see if I can extract one of those files, but none worked.

import zipfile
zip_ref = zipfile.ZipFile('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0', 'r')
zip_ref.extractall('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
zip_ref.close()

I have also tried:

import tarfile
tar = tarfile.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D')
tar.extractall()
tar.close

here is my third try (and no luck):

import gzip
import json
with gzip.open('E:/echoverse/Subdivided Tweets/Subdivided Tweets/Tweets-0/00_activities.json.gz%3FAWSAccessKeyId=AKIAJADH5KHBJMUZOPEA&Expires=1404665927&Signature=%2BdCn%252Ffn%2BFfRQhknWWcH%2BtnwlSfk%3D'
, 'rb') as f:
    d = json.loads(f.read().decode("utf-8"))

There is another very similar threat on stackover flow, but my question is different in that my zipped file is originally JSON, and when I use this last method I get this error: Exception has occurred: json.decoder.JSONDecodeError Expecting value: line 1 column 1 (char 0)

Carlos
  • 1,897
  • 3
  • 19
  • 37
Mike Sal
  • 197
  • 1
  • 4
  • 13
  • Possible duplicate of [How to loop through directories and unzip tar.gz files?](https://stackoverflow.com/questions/30293757/how-to-loop-through-directories-and-unzip-tar-gz-files) – Carlos Feb 16 '19 at 05:30
  • thanks, but my problem now is how to turn gz files to JSON. that question is about tar files – Mike Sal Feb 16 '19 at 06:02
  • Your last attempt should have worked (and the previous two never) if the contents were actually JSON. What does the data you extracted look like, i.e. the output from `f.read()`? – tripleee Feb 16 '19 at 06:54
  • 1
    Possible duplicate of [Python 3, read/write compressed json objects from/to gzip file](https://stackoverflow.com/questions/39450065/python-3-read-write-compressed-json-objects-from-to-gzip-file) – tripleee Feb 16 '19 at 06:56
  • the files are JSON, and with Carlos's code, I was able to see the the files being decoded and printed. But I still don't know how to store the unzipped files. – Mike Sal Feb 16 '19 at 07:15

1 Answers1

0

Simple script that answers the question: it traverses, checks if file (fname) is a gzip (via magic number because I'm cynical) and unzips it.

import json
import gzip
import binascii
import os


def is_gz_file(filepath):
    with open(filepath, 'rb') as test_f:
        return binascii.hexlify(test_f.read(2)) == b'1f8b'


rootDir = '.'
for dirName, subdirList, fileList in os.walk(rootDir):
    for fname in fileList:
        filepath = os.path.join(dirName,fname)
        if is_gz_file(filepath):
            f = gzip.open(filepath, 'rb')
            json_content = json.loads(f.read())
            print(json_content)

Tested and it works.

Carlos
  • 1,897
  • 3
  • 19
  • 37
  • Thanks a lot. It almost completely answered my question. my last question is how to store the unzipped files in JSON format. – Mike Sal Feb 16 '19 at 07:02
  • actually, it gives the following error: Exception has occurred: json.decoder.JSONDecodeError Extra data: line 2 column 1 (char 1464) – Mike Sal Feb 16 '19 at 07:20
  • Updated code that stores the contents of the file as `json` objects. – Carlos Feb 16 '19 at 07:28