1

I have a python script where I'm trying to read all .txt files in a directory and determine if they return True or False for any conditions that are in my script. I have thousands of .txt files with text in .json format. However, I'm getting an error message saying invalid .json format. I have checked that my text files are in .json format. I want the script to determine if the .txt file matches any of the statements in my code below. I then want to output the result to a csv file. Your help is very much appreciated! I have included my error messages and example .txt file.

Example .txt file with .json formattting

{
    "domain_siblings": [
        "try.wisebuygroup.com.au",
        "www.wisebuygroup.com.au"
    ],
    "resolutions": [
        {
            "ip_address": "34.238.73.135",
            "last_resolved": "2018-04-22 17:59:05"
        },
        {
            "ip_address": "52.0.100.49",
            "last_resolved": "2018-06-24 17:05:06"
        },
        {
            "ip_address": "52.204.226.220",
            "last_resolved": "2018-04-22 17:59:06"
        },
        {
            "ip_address": "52.22.224.230",
            "last_resolved": "2018-06-24 17:05:06"
        }
    ],
    "response_code": 1,
    "verbose_msg": "Domain found in dataset",
    "whois": null
}

Error message

line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Code

import os
import json
import csv

path=r'./output/'
csvpath='C:/Users/xxx/Documents/csvtest'
file_n = 'file.csv'

def vt_result_check(path):
    vt_result = False
    for filename in os.listdir(path):
        with open(path + filename, 'r') as vt_result_file:
            vt_data = json.load(vt_result_file)

        # Look for any positive detected referrer samples
        # Look for any positive detected communicating samples
        # Look for any positive detected downloaded samples
        # Look for any positive detected URLs
        sample_types = ('detected_referrer_samples', 'detected_communicating_samples',
                        'detected_downloaded_samples', 'detected_urls')
        vt_result |= any(sample['positives'] > 0 for sample_type in sample_types
                                                 for sample in vt_data.get(sample_type, []))

        # Look for a Dr. Web category of known infection source
        vt_result |= vt_data.get('Dr.Web category') == "known infection source"

        # Look for a Forecepoint ThreatSeeker category of elevated exposure
        # Look for a Forecepoint ThreatSeeker category of phishing and other frauds
        # Look for a Forecepoint ThreatSeeker category of suspicious content
        threats = ("elevated exposure", "phishing and other frauds", "suspicious content")
        vt_result |= vt_data.get('Forcepoint ThreatSeeker category') in threats

    return str(vt_result)


if __name__ == '__main__':
    with open(file_n, 'w') as output:
        for i in range(vt_result_file):
            output.write(vt_result_file, vt_result_check(path))
bedford
  • 181
  • 1
  • 13
  • I got no error while loading the file. It is probably the checks you have implemented causing the trouble. Can you point the line of code at which the error is thrown? – mad_ Jul 26 '18 at 12:59
  • You might want to restrict your script to only parsing files with the suffix `.txt` -- see https://stackoverflow.com/questions/3964681/find-all-files-in-a-directory-with-extension-txt-in-python – i alarmed alien Jul 26 '18 at 13:04

3 Answers3

1

Your are trying to decode JSON from empty file (size 0). Check your filepath and content of that file.

Note: the example you have provided in your question is valid JSON, it should load without problem.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • @Andrej_Kesely thank you for your help! That seemed to fix my original error but now I'm getting another error. The error message is output.write((vt_result_check(path))) TypeError: write() argument must be str, not int How do I convert vt_result_check(path) to an int? – bedford Jul 27 '18 at 18:05
  • @bedford You are returning `int` from your function `vt_result_check(path)`. Check the return statements inside it. You need to return string. – Andrej Kesely Jul 27 '18 at 18:08
  • @Andrej_Kesely Thank you again for your response! I'm also trying to write the filename to my csv and I'm getting the following error when doing so. I've updated my code above. for i in range(vt_result_file): NameError: name 'vt_result_file' is not defined – bedford Jul 27 '18 at 19:15
  • @bedford I would like to help you, but comments aren't good place for that. If you have other new questions, not related to this question, you need to create new question here on Stack Overflow. – Andrej Kesely Jul 27 '18 at 19:18
  • @Andrej_Kesely Thank you! I posted as a new question. Here is the link: https://stackoverflow.com/questions/51564344/trying-to-write-filename-to-csv-in-python I would appreciate your help! – bedford Jul 27 '18 at 19:24
  • @bedford If this question you asked originally is answered, don't forget to close it by accepting one of answers here. – Andrej Kesely Jul 27 '18 at 19:28
0

You are not opening files...

for filename in os.listdir(path):
    with open(path + filename, 'r') as vt_result_file:
        vt_data = json.load(vt_result_file)

listdir - lists all dirs and files in the path.

Adelina
  • 10,915
  • 1
  • 38
  • 46
  • `os.listdir(path='.')`: Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order, and does not include the special entries '.' and '..' even if they are present in the directory. – i alarmed alien Jul 26 '18 at 13:06
  • @Nuts Thank you for your response! I tried your recommended change but unfortunately I still get the same error message. Do you have any other advice? – bedford Jul 26 '18 at 14:23
  • Find all files in the path using: https://stackoverflow.com/questions/3964681/find-all-files-in-a-directory-with-extension-txt-in-python Then read them – Adelina Jul 26 '18 at 14:28
  • 1
    @Nuts: your statement that "listdir - lists all dirs in the path" is wrong. – i alarmed alien Jul 26 '18 at 16:03
  • @Nuts thank you for your help! That seemed to fix my original error but now I'm getting another error. The error message is output.write((vt_result_check(path))) TypeError: write() argument must be str, not int How do I convert vt_result_check(path) to an int? – bedford Jul 27 '18 at 18:05
0

I suggest (1) limiting your script to only parsing .txt files, and (2) adding some basic error checking in the form of a try/except statement to catch any JSON errors that do occur. Something like this:

def vt_result_check(path):
    vt_result = False
    for file in os.listdir(path):
        if not file.endswith(".txt"): # skip anything that doesn't end in .txt
            continue

        with open(path + file, 'r') as vt_result_file:
            try:
                vt_data = json.load(vt_result_file)
                # do whatever you want with the json data
            except Exception:
                print("Could not parse JSON file " + file)

You can fill in the rest of your code around this.

i alarmed alien
  • 9,412
  • 3
  • 27
  • 40
  • @i_alarmed_alien thank you for your help! That seemed to fix my original error but now I'm getting another error. The error message is output.write((vt_result_check(path))) TypeError: write() argument must be str, not int How do I convert vt_result_check(path) to an int? – bedford Jul 27 '18 at 18:04
  • to convert something to a string, use `str()` -- e.g. `str(vt_result_check(path))`. Check out the Python documentation as all this information is there: https://docs.python.org/3/library/stdtypes.html#str – i alarmed alien Jul 27 '18 at 19:26