0

In the script, for each text file, I check the first two characters. If the first two characters are "[{" which means it is a JSON file, then execute other codes.

However, I have to read the file twice with open(f, 'r', encoding = 'utf-8', errors='ignore' as infile:, which is duplicated. Is there any better way to write this code?

result = []  
                                      
for f in glob.glob("D:/xxxxx/*.txt"):       
    print("file_name: ",f)
    with open(f, 'r', encoding = 'utf-8', errors='ignore') as infile:       
        first_two_char = infile.read(2)
        print(str(first_two_char ))
        if first_two_char  == "[{":
            with open(f, 'r', encoding = 'utf-8', errors='ignore') as infile:       
                json_file = json.load(infile, strict=False)   
                print(len(json_file))
                result.append(json_file)            #here appending the list with Jason content 

print(len(result))
rpanai
  • 12,515
  • 2
  • 42
  • 64
rui jiang
  • 51
  • 3
  • 2
    I suppose you could always use [seek](https://python-reference.readthedocs.io/en/latest/docs/file/seek.html) to reset the cursor rather than reopening the file. – Anthony Labarre Aug 17 '20 at 16:04
  • 2
    Your approach is wrong. Instead of making sure if it's JSON and reading, just **TRY** reading it as JSON and if it doesn't work, do nothing... – Tomerikoo Aug 17 '20 at 16:10
  • @Tomerikoo Thanks a lot! Yes, you are right. I have changed my code accordingly. It looks better and works well. Thanks again. – rui jiang Aug 17 '20 at 16:59
  • @AnthonyLabarre Thank you! You really answered my question. Next time when I come across with this issue, I will try `seek`. – rui jiang Aug 17 '20 at 17:01

1 Answers1

1

You could seek(0) to move the file pointer back to zero. Generally, seeking doesn't work with files opened as text because there is an itermediate cache for bytes-to-string decoding. But seek(0) and seek to end of file work.

result = []  
                                      
for f in glob.glob("D:/xxxxx/*.txt"):       
    print("file_name: ",f)
    with open(f, 'r', encoding = 'utf-8', errors='ignore') as infile:       
        first_two_char = infile.read(2)
        print(str(first_two_char ))
        if first_two_char  == "[{":
            infile.seek(0)
            json_file = json.load(infile, strict=False)   
                print(len(json_file))
                result.append(json_file)            #here appending the list with Jason content 

print(len(result))

result = []  

But really, just attempting the conversion and catching the error is a better way to go. Suppose the first two characters looked okay only by bad luck?

for f in glob.glob("D:/xxxxx/*.txt"):       
    print("file_name: ",f)
    with open(f, 'r', encoding = 'utf-8', errors='ignore') as infile:
        try:
            result.append(json.load(infile))
        except  json.decoder.JSONDecodeError:
            pass      
print(len(result))
tdelaney
  • 73,364
  • 6
  • 83
  • 116