0

I have a JSON file of the following form:

 {'query': {'tool': 'domainquery', 'query': 'example.org'},
 'response': {'result_count': '1',
  'total_pages': '1',
  'current_page': '1',
  'matches': [{'domain': 'example2.org',
    'created_date': '2015-07-25',
    'registrar': 'registrar_10'}]}}

I have a list of the following form:

removal_list=["example2.org","example3.org"...]

I am trying to loop through the removal_list and remove all instances of each item from the JSON file. The issue is how long it takes to compute, with removal_list containing 110,000 items. I have tried to make this faster by using set() and isdisjoint, but this does not make it any faster it seems.

The code I currently have to do this is:

    removal_list= set(removal_list)
    for domain in removal_list:
        for i in range(len(JSON_file)):
            if int(JSON_file[i]['response']['result_count'])>0:  
                for j in range(len(JSON_file[i]['response']['matches'])):
                    for item in JSON_file[i]['response']['matches'][j]['domain']:
                        if not remove_set.isdisjoint(JSON_file[i]['response']['matches'][j]['domain']):
                            del(JSON_file[i]['response']['matches'][j]['domain'])
                        else: 
                            pass

Does anyone have any suggestions on how to speed this process up? Thanks in advance.

  • i suggest using binary chop? it might help. https://stackoverflow.com/questions/9501337/binary-search-algorithm-in-python – 2wen May 23 '22 at 18:39
  • Try *hoisting* common sub-expressions; for example, save `JSON_file[i]['reponse']` in a variable, and use it wherever you use that expression. – Scott Hunter May 23 '22 at 18:39
  • 1
    @2wen: Doesn't that require sorted lists? – Scott Hunter May 23 '22 at 18:41
  • Are you saying that **any** value in the removal_list when observed as a value in a dictionary **anywhere** in the main dictionary, has to be removed? I think you could make this clearer by showing an input data structure and the expected output structure. As it stands, your code looks remarkably convoluted and may not need to be that complex – DarkKnight May 23 '22 at 18:54

1 Answers1

0

The looping in the question is 'inverted'. That is to say that JSON_File (which is clearly a list of dictionaries) should be enumerated and examined to see if there are any dictionaries within the 'matches' list that have a domain in the removal_list.

Let's have just two dictionaries in the JSON_File list and then show the code to process them.

removal_list = {"example2.org", "example3.org"}

d1 = {'query': {'tool': 'domainquery', 'query': 'example.org'},
     'response': {'result_count': '1',
                  'total_pages': '1',
                  'current_page': '1',
                  'matches': [{'domain': 'example2.org',
                               'created_date': '2015-07-25',
                               'registrar': 'registrar_10'}]}}
d2 = {'query': {'tool': 'domainquery', 'query': 'example.org'},
     'response': {'result_count': '1',
                  'total_pages': '1',
                  'current_page': '1',
                  'matches': [{'domain': 'example3.org',
                               'created_date': '2015-07-25',
                               'registrar': 'registrar_10'}]}}

JSON_File = [d1, d2]

for j in JSON_File:
    if matches := j['response'].get('matches'):
        for match in matches:
            if match.get('domain') in removal_list:
                del match['domain']

print(JSON_File)

Assumption:

if result_count is non-zero then there will be a non-empty 'matches' list which means that there's no need to explicitly examine the 'result_count value'

Note:

Requires Python 3.8+

DarkKnight
  • 19,739
  • 3
  • 6
  • 22