Python: compare two json files and replace similar strings

Question

I have to create a script in Python, which allows me to replace strings in a json file. This file contains patent information, for example:

{
  "US-8163793-B2": {
    "publication_date": "20120424",
    "priority_date": "20090420",
    "family_id": "42261969",
    "country_code": "US",
    "ipc_code": "C07D417/14",
    "cpc_code": "C07D471/04",
    "assignee_name": "Hoffman-La Roche Inc.",
    "title": "Proline derivatives",
    "abstract": "The invention relates to a compound of formula (I) wherein A, R 1 -R 6  are as defined in the description and in the claims. The compound of formula (I) can be used as a medicament."
  }

However, there are about 15,000 entries. To normalize this document, before performing word embedding, I use software which includes tags in the terms found. The output looks like this:

 "Row_1" : {
  "COMPANY": [
    {
      "hitCount": 1,
      "sourceTitle": "",
      "sourceID": "",
      "docTitle": "",
      "docID": "Row_1",
      "hitID": "COMP642",
      "name": "Roche",
      "frag_vector_array": [
        "16#Hoffman-La {!Roche!} Inc."
      ],
      "totnosyns": 1,
      "goodSynCount": 1,
      "nonambigsyns": 1,
      "score": 1,
      "hit_loc_vector": [
        16
      ],
      "word_pos_array": [
        2
      ],
      "exact_string": "16#90-95",
      "exact_array": [
        {
          "fls": [
            16,
            90,
            95
          ]
        }
      ],
      "entityType": "COMPANY",
      "realSynList": [
        "Roche"
      ],
      "dictSynList": [
        "roche"
      ],
      "kvp": {
        "entityType": "COMPANY"
      },
      "rejected": false,
      "entityMeta": {
        "_ext_name": "Wikipedia",
        "_ext_uri": "http://en.wikipedia.org/wiki/Roche",
        "_termite_id": "TCP000392"
      },
      "section_vector": [
        8
      ],
      "dependencyMet": true,
      "fuzzyMatches": 0,
      "sectionMeta": {
        "8": "assignee_name|"
      }
    }
  ]
}

This output is also a json file and would be used as a dictionary.

What I need is to replace the terms "name", for example "Roche", with the "hitID", like "COMP642", every time that this term appears in the Patents file.

I very very new in Python, so any help or reading recommendation will be of great help.

Thank you!

EDIT

What a tried so far:

 with open(file, "rb") as datafile:
        json_data = json.loads(datafile.read().decode("utf-8"))  # type: object

        for paper in json_data:

            termite_dict = dict()
            termite_dict_all_per_pmid = list()
            pmid = int(paper["docID"])
            abstract = paper["abstract"]

            gene_list = list()
            indication_mesh_list = list()
            drug_list = list()
            mirna_list = list()
            company_list = list()
            bioproc_list = list()
            protype_list = list()

            if "termiteTags" in paper:
                for termite_tag in paper["termiteTags"]:
                    type_entry = termite_tag["entityType"]

                    termite_dict = dict()
                    name = termite_tag["name"]
                    exact_tag_locations = termite_tag["exact_string"].split(",")
                    relevant_tag_locations = list()
                    words_to_replace = list()

                    # process and store termite annotations
                    if type_entry == "GENE":
                        gene_list.append({"Gene": termite_tag["hitID"]})
                    elif type_entry == "INDICATION":
                        info = termite_tag["entityMeta"]
                        if "mesh_tree" in info:
                            for e in list(filter(None, termite_tag["entityMeta"]["mesh_tree"].split(";"))):
                                try:
                                    mesh_id = mesh_tree_nr_to_id_dict[e]
                                    mesh_name = mesh_id_to_name_dict[mesh_id]
                                    indication_mesh_list.append({"name": mesh_name, "id": mesh_id, "key": e})
                                except KeyError:
                                    continue
                        elif "_ext_uri" in info:
                            url = termite_tag["entityMeta"]["_ext_uri"]
                            try:
                                mesh_id = url.split("term=")[1]
                                mesh_name = mesh_id_to_name_dict[mesh_id]
                                mesh_tree_nr = name_to_mesh_id_dict[mesh_name]
                                indication_mesh_list.append({"name": mesh_name, "id": mesh_id, "key": mesh_tree_nr})
                            except KeyError:
                                print("Issue with Mesh key indication")
                    elif type_entry == "DRUG":
                        drug_list.append(termite_tag["name"])
                    elif type_entry == "MIRNA":
                        mirna_list.append(termite_tag["hitID"])
                    elif type_entry == "COMPANY":
                        company_list.append(termite_tag["name"])
                    elif type_entry == "BIOPROC":
                        bioproc_list.append(termite_tag["name"])
                    elif type_entry == "PROTYP":
                        protype_list.append(termite_tag["name"])

                    # store info for positions with words to normalize in abstract text
                    for hit_number, hit in enumerate(termite_tag["frag_vector_array"]):
                        hit = hit.replace("\n", " ")

                        try:
                            match = re.match(r"^.*{!(.*)!}.*$", hit)
                            match_word = match.group(1)
                        except AttributeError:
                            try:
                                match = re.match(r"^.*{\*(.*)\*\}.*$", hit)
                                match_word = match.group(1)
                            except AttributeError:
                                print(hit)

                        if match_word.lower() != name.lower():
                            exact_locus = exact_tag_locations[hit_number]
                            if not exact_locus.startswith("-"):
                                # sentence 0 is paper title
                                if not exact_locus.startswith("0"):
                                    relevant_tag_locations.append(exact_tag_locations[hit_number])
                                    words_to_replace.append(match_word)
                                    termite_dict["norm"] = name
                                    termite_dict["replace"] = match_word
                                    fr, t = exact_locus.split("#")[1].split("-")
                                    termite_dict["from"] = int(fr)
                                    termite_dict["to"] = int(t)
                                    termite_dict["len"] = int(t) - int(fr)
                                    termite_dict["entityCode"] = entity_type_encoder[termite_tag["entityType"]]
                                    termite_dict_all_per_pmid.append(termite_dict)
                                    termite_dict = dict()

            # abstract normalization and bag of words calculations
            if len(termite_dict_all_per_pmid) > 0:
                sorted_termite_dict_all_per_pmid = sorted(termite_dict_all_per_pmid,
                                                          key=lambda k: (k['from'], -k["len"], k["entityCode"]))
                normalized_abstract = normalize_abstract(sorted_termite_dict_all_per_pmid, abstract)
                termite_dict["Norm_Abstract"] = normalized_abstract
                cleaned_abstract_text = abstract_to_words(normalized_abstract)
                termite_dict["bag_of_words"] = list(set(cleaned_abstract_text))

            termite_dict["docID"] = pmid

            if "keywords" in paper:
                keywords = [w.strip() for w in paper["keywords"].split(";")]
                mesh_list = list()

                for word in keywords:
                    if len(word.split(" ")) == 1 and len(word) > 0 and word[0].islower():
                        word = word.title()
                    if word in name_to_mesh_id_dict:
                        mesh_id = name_to_mesh_id_dict[word]
                        try:
                            mesh_list.append([word, mesh_id, mesh_id_to_tree_nr_dict[mesh_id]])
                        except KeyError:
                            mesh_list.append([word, mesh_id, ""])
                termite_dict["MeshHeadings"] = mesh_list

            if len(gene_list) > 0:
                termite_dict["Genes"] = gene_list
            if len(indication_mesh_list) > 0:
                termite_dict["Indications"] = indication_mesh_list
            if len(drug_list) > 0:
                termite_dict["Drug"] = drug_list
            if len(mirna_list) > 0:
                termite_dict["MIRNA"] = mirna_list
            if len(company_list) > 0:
                termite_dict["Company"] = company_list
            if len(bioproc_list) > 0:
                termite_dict["Bioproc"] = bioproc_list
            if len(protype_list) > 0:
                termite_dict["Protyp"] = protype_list

            # add meta list to be able to query for gene nd indication co-occurrence
            meta_list = list()
            if "Indications" in termite_dict:
                meta_list.extend([indi["key"] for indi in termite_dict["Indications"]])
            if "Genes" in termite_dict:
                meta_list.extend([gene["Gene"] for gene in termite_dict["Genes"]])
            if len(meta_list) > 0:
                termite_dict["all_genes_indications"] = meta_list

            termite_dict_list.append(termite_dict)
    return termite_dict_list

Not sure I am following your question. Are you just trying to replace certain values in your first json with some corresponding values from your second json? If so, is there a corresponding key (i.e. "assignee_name" from the first file corresponds to some specific key from the second file)? Or are you trying to replace all values that match a particular string (regardless of key)? — benvc, Aug 16 '18 at 14:14
@benvc Hello! Yeah, is more the first option: replace some values in the first file, based on the second file. So, in the example I gave it was "COMPANY", which will always be detected in the "assignee_name" and I want to normalize that name, to always be the same name for company (example: BMW Group, BMG AG, should be only BMW). But there are other tags, such as "GENE", in this case, the gene name can be found in the title or abstract. And no matter where it is, this string has to be normalized. Is better to understand now? sry — Bruna B, Aug 17 '18 at 07:07
Dear, @benvc I'm having a problem uploading my json-file. I replaced in your code patents by: `patents = json.load(open("assignee.json"))` and companies by `companies = json.load(open("termite_assignee.json"))`. But I'm getting this error: `company_id = company['COMPANY'][0]['hitID']. TypeError: list indices must be integers, not str` ... can you help me with that? Thank you! — Bruna B, Aug 20 '18 at 13:58
It may be that your dataset does not match the example in my answer exactly. Notice in my answer that the `companies` dataset is a set of objects (i.e. `"Row_1"`) that each include a `"COMPANY"` array). Does your dataset possibly include something in the hierarchy outside the `"Row_1"` type keys? Maybe the `"Row_1"` type keys are organized in an array? — benvc, Aug 20 '18 at 14:15
@benvc yes, it does includes, I have a hierarchy, that looks like this `{ "TERMITE_RESULT" : [{ "RESP_META" : {"JSON_PRODUCER": "EFFICIENT"...}, "RESP_WARNINGS" : null ,"RESP_PAYLOAD": {} ,"RESP_MULTIDOC_PAYLOAD": { "Row_20" : { "COMPANY": [ {...]` — Bruna B, Aug 20 '18 at 14:26
So you will need to adjust the solution to your dataset. Are there multiple `"RESP_MULTIDOC_PAYLOAD"` objects in the dataset, each containing just one `"Row_#"` object with a `"COMPANY"` array or does a single `"RESP_MULTIDOC_PAYLOAD"` object contain all of the `"Row_#"` objects? — benvc, Aug 20 '18 at 14:50
@benvc There’s only one ‘“RESP_MULTIDOC_PAYLOAD”’ object, which contains all the ‘“Row_#”’ objects. Sorry for the delay answer, I got kind of seak. — Bruna B, Aug 21 '18 at 11:56
Then you just need to set the `companies` variable to the `"RESP_MULTIDOC_PAYLOAD"` dataset after you load your json. Just add this line right before the loop: `companies = companies['TERMITE_RESULT'][0]['RESP_MULTIDOC_PAYLOAD']`. Then you should be accessing the same dataset as shown in the example answer and the loop should work as currently constructed. — benvc, Aug 21 '18 at 12:54
@benvc I came back to work today and tried, it worked perfectly. Thank you so much, now I'll start expanding to cover other tags in my data. You helped me a lot, thank you very much for your patience. — Bruna B, Aug 27 '18 at 11:46
@benvc just one more consideration, to apply in a context, where only the word found will be modified in the abstract. I should give the "location" or it would function in the same way. For example, I have the tag `DRUG`, that the word founded in `hitID` should replace the `name` , however that word is within a text. It is almost like `COMPANY`, but the Word will be in a abstract, no direct like assignee_name was. — Bruna B, Aug 28 '18 at 08:10
For that, look at `str.replace()`, see [How to use str.replace()?](https://stackoverflow.com/questions/9452108/how-to-use-string-replace-in-python-3-x) — benvc, Aug 28 '18 at 12:49

benvc · Accepted Answer · 2018-08-17T15:31:23.420

If I am following what you are after, I think you want to replace the "assignee_name" in your patents data with a corresponding "hitID" from your companies data based on the company data "name" being included somewhere in the patent data "assignee_name".

A couple of loops should do the trick (though I am sure there is a more elegant approach). Of course, if you need something more sophisticated to determine if the "name" from the companies data is really a match for the "assignee_name" in the patents data, then you could add some regex, etc to this approach but this should get you pointed in the right direction.

import json

patents = json.loads("""{
        "US-8163793-B2": {
            "publication_date": "20120424",
            "assignee_name": "Hoffman-La Roche Inc."
        },
        "US-1234567-A1": {
            "publication_date": "20010101",
            "assignee_name": "ABC Inc."
        }
    }""")

companies = json.loads("""{
        "Row_1": {
            "COMPANY": [
                {
                    "hitID": "COMP642",
                    "name": "Roche"
                }
            ]
        },
        "Row_2": {
            "COMPANY": [
                {
                    "hitID": "COMP123",
                    "name": "ABC"
                }
            ]
        }
    }""")

# loop through companies data
for company in companies.values():
    company_id = company['COMPANY'][0]['hitID']
    company_name = company['COMPANY'][0]['name']

    # update patents where company "name" included in "assignee_name"
    for patent in patents.values():
        if company_name in patent['assignee_name']:
            patent['assignee_name'] = company_id

print(patents)

# OUTPUT (use json.dump to write to file if needed)
#
# {
#     'US-1234567-A1': {'assignee_name': 'COMP123', 'publication_date': '20010101'},
#     'US-8163793-B2': {'assignee_name': 'COMP642', 'publication_date': '20120424'}
# }

Thank you very much for your help, it helped me a lot to understand the substitutions and I will try to implement for the rest of the variables. And thanks for the tips on how to improve my way of asking around here =] — Bruna B, Aug 20 '18 at 06:53

Python: compare two json files and replace similar strings

1 Answers1