How to convert XML NER data from the CRAFT corpus to spaCy's JSON format?

Question

How to build a named entity recognition(NER) model using spaCy for biomedical NER on CRAFT corpus?

It is difficult for me to pre-process the xml files given in that corpus to any format used by spacy, any little help would be highly appreciated. I first converted the xml files to json format but that was not accepted by spacy. What format of training data does spacy expect? I even tried to build my own NER model but was not able to pre-process the xml files as given in this article.

Here is an example of training an NER model using spacy, including the expected format of training data (from spacy's docs):

import random

import spacy


TRAIN_DATA = [
        ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
        ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]

nlp = spacy.blank("en")
optimizer = nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk("/model")

The XML file I am using is available online here. An example record looks like:

<passage>
<infon key="section_type">ABSTRACT</infon>
<infon key="type">abstract</infon>
<offset>141</offset>
<text>
Breast cancer is the most frequent tumor in women, and in nearly two-thirds of cases, the tumors express estrogen receptor alpha (ERalpha, encoded by ESR1). Here, we performed whole-exome sequencing of 16 breast cancer tissues classified according to ESR1 expression and 12 samples of whole blood, and detected 310 somatic mutations in cancer tissues with high levels of ESR1 expression. Of the somatic mutations validated by a different deep sequencer, a novel nonsense somatic mutation, c.2830 C>T; p.Gln944*, in transcriptional regulator switch-independent 3 family member A (SIN3A) was detected in breast cancer of a patient. Part of the mutant protein localized in the cytoplasm in contrast to the nuclear localization of ERalpha, and induced a significant increase in ESR1 mRNA. The SIN3A mutation obviously enhanced MCF7 cell proliferation. In tissue sections from the breast cancer patient with the SIN3A c.2830 C>T mutation, cytoplasmic SIN3A localization was detected within the tumor regions where nuclear enlargement was observed. The reduction in SIN3A mRNA correlates with the recurrence of ER-positive breast cancers on Kaplan-Meier plots. These observations reveal that the SIN3A mutation has lost its transcriptional repression function due to its cytoplasmic localization, and that this repression may contribute to the progression of breast cancer.
</text>
<annotation id="38">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="246" length="23"/>
<text>estrogen receptor alpha</text>
</annotation>
<annotation id="39">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="271" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="40">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="291" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="41">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="392" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="42">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="512" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="43">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="720" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="44">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="868" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="45">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="915" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="46">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="930" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="47">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1048" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="48">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1087" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="49">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1201" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="50">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1331" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="51">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="185" length="5"/>
<text>women</text>
</annotation>
<annotation id="52">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="762" length="7"/>
<text>patient</text>
</annotation>
<annotation id="53">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="1031" length="7"/>
<text>patient</text>
</annotation>
<annotation id="54">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="397" length="10"/>
<text>expression</text>
</annotation>
<annotation id="55">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="517" length="10"/>
<text>expression</text>
</annotation>
<annotation id="56">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="1054" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="57">
<infon key="identifier">CVCL:0031</infon>
<infon key="type">CellLine</infon>
<location offset="964" length="4"/>
<text>MCF7</text>
</annotation>
<annotation id="58">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1494" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="59">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="346" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="60">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="743" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="61">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1017" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="62">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="477" length="6"/>
<text>cancer</text>
</annotation>
<annotation id="63">
<infon key="identifier">p.Q944*</infon>
<infon key="type">ProteinMutation</infon>
<location offset="642" length="9"/>
<text>p.Gln944*</text>
</annotation>
<annotation id="64">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="1130" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="65">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="176" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="66">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="630" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="67">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1258" length="14"/>
<text>breast cancers</text>
</annotation>
<annotation id="68">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="231" length="6"/>
<text>tumors</text>
</annotation>
<annotation id="69">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="141" length="13"/>
<text>Breast cancer</text>
</annotation>
</passage>

Please add something like what things you have already tried at your end to show your efforts too. Also some more description would also help people here to understand what is the problem. — VPK, Dec 10 '19 at 06:40
@Angelina Relevant [how-does-spacy-use-word-embeddings-for-named-entity-recognition-ner](https://stackoverflow.com/questions/44492430/how-does-spacy-use-word-embeddings-for-named-entity-recognition-ner) and [parsing-html-in-python-lxml-or-beautifulsoup-which-of-these-is-better-for-wha](https://stackoverflow.com/questions/1922032/parsing-html-in-python-lxml-or-beautifulsoup-which-of-these-is-better-for-wha) — stovfl, Dec 10 '19 at 10:55
What does the XML data look like? What data format does spacy expect? I bet if you put these things in the question, you would get an answer — Sam H., Dec 11 '19 at 19:49
@SamH. Thanks, I edited the question a bit. Do you have any idea about it? — ishas, Dec 12 '19 at 08:29
Angelina, I feel like I could be way more effective helping if you shared: (1) the structure of your XML data, (2) your current understanding of spaCy's NER format, (3) any code you have tried to convert between the two — Sam H., Dec 12 '19 at 16:43
@SamH. The XML data can be found at : https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmcids=PMC6207735 Also I think spacy's ner format is more like : [(text) {list of entities}], for example- [("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})] as given in spacy's website.. It is given spacy uses JSON format. So I tried to convert this to JSON format but spacy-train gave an error after I tried to use this as the TRAIN DATA. — ishas, Dec 13 '19 at 17:56
@Angelina - do you need to keep track of `identifier` or `NCBI Homologene` properties? — Sam H., Dec 17 '19 at 05:11
@SamH. No, I don't need to keep track of these. The answer you gave will work fine for me. Thanks a lot for your efforts ! — ishas, Dec 19 '19 at 02:57
@angelina if the answer is satisfactory, I'd appreciate if you accepted it and/or upvoted. I'm in it for those made up, internet points. — Sam H., Dec 19 '19 at 03:07

score 3 · Accepted Answer · answered Dec 17 '19 at 06:57

Here is some code to get you going. It is not a complete solution, but the problem you posed is very hard, and you didn't have any starter code.

It does not track the identifier or NCBI Homologene properties, but I think those can be stored in a dictionary separately.

import xml.etree.cElementTree as ET

import spacy

nlp = spacy.load('en_core_web_sm')

# this is one child of the XML doc
# https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmcids=PMC6207735
passage_string = """
<passage>
<infon key="section_type">ABSTRACT</infon>
<infon key="type">abstract</infon>
<offset>141</offset>
<text>
Breast cancer is the most frequent tumor in women, and in nearly two-thirds of cases, the tumors express estrogen receptor alpha (ERalpha, encoded by ESR1). Here, we performed whole-exome sequencing of 16 breast cancer tissues classified according to ESR1 expression and 12 samples of whole blood, and detected 310 somatic mutations in cancer tissues with high levels of ESR1 expression. Of the somatic mutations validated by a different deep sequencer, a novel nonsense somatic mutation, c.2830 C>T; p.Gln944*, in transcriptional regulator switch-independent 3 family member A (SIN3A) was detected in breast cancer of a patient. Part of the mutant protein localized in the cytoplasm in contrast to the nuclear localization of ERalpha, and induced a significant increase in ESR1 mRNA. The SIN3A mutation obviously enhanced MCF7 cell proliferation. In tissue sections from the breast cancer patient with the SIN3A c.2830 C>T mutation, cytoplasmic SIN3A localization was detected within the tumor regions where nuclear enlargement was observed. The reduction in SIN3A mRNA correlates with the recurrence of ER-positive breast cancers on Kaplan-Meier plots. These observations reveal that the SIN3A mutation has lost its transcriptional repression function due to its cytoplasmic localization, and that this repression may contribute to the progression of breast cancer.
</text>
<annotation id="38">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="246" length="23"/>
<text>estrogen receptor alpha</text>
</annotation>
<annotation id="39">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="271" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="40">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="291" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="41">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="392" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="42">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="512" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="43">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="720" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="44">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="868" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="45">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="915" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="46">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="930" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="47">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1048" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="48">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1087" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="49">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1201" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="50">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1331" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="51">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="185" length="5"/>
<text>women</text>
</annotation>
<annotation id="52">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="762" length="7"/>
<text>patient</text>
</annotation>
<annotation id="53">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="1031" length="7"/>
<text>patient</text>
</annotation>
<annotation id="54">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="397" length="10"/>
<text>expression</text>
</annotation>
<annotation id="55">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="517" length="10"/>
<text>expression</text>
</annotation>
<annotation id="56">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="1054" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="57">
<infon key="identifier">CVCL:0031</infon>
<infon key="type">CellLine</infon>
<location offset="964" length="4"/>
<text>MCF7</text>
</annotation>
<annotation id="58">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1494" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="59">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="346" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="60">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="743" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="61">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1017" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="62">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="477" length="6"/>
<text>cancer</text>
</annotation>
<annotation id="63">
<infon key="identifier">p.Q944*</infon>
<infon key="type">ProteinMutation</infon>
<location offset="642" length="9"/>
<text>p.Gln944*</text>
</annotation>
<annotation id="64">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="1130" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="65">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="176" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="66">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="630" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="67">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1258" length="14"/>
<text>breast cancers</text>
</annotation>
<annotation id="68">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="231" length="6"/>
<text>tumors</text>
</annotation>
<annotation id="69">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="141" length="13"/>
<text>Breast cancer</text>
</annotation>
</passage>"""

# turn into an object
passage = ET.fromstring(passage_string)

# these 3 definitions are per-passage
passage_annotations = passage.findall('./annotation')
passage_offset = int(passage.find('offset').text)
passage_text = passage.find('text').text

def get_entity_offset(offset_dict, passage_offset):
    """
    XML given offset_dict gives offset relative to the start of the document
    So subtract the passage offset (where passage starts relative to document beginning)
    """
    start = int(offset_dict['offset']) - passage_offset
    end = int(offset_dict['offset']) + (int(offset_dict['length']) + 1) - passage_offset
    return start, end

# collect entities as a list of tuples of the form
# (start, end, entitiy_type)
passage_entities = []
for ann in passage_annotations:
    entity_type = ann.find('./infon[@key="type"]').text
    od = ann.find('./location').attrib
    start, end = get_entity_offset(od, passage_offset)
    passage_entities.append((start, end, entity_type))

# this is one entry in the spacy NER format
# you would want many entries
spacyd_passage = (passage_text, {"entities": passage_entities})

# prove this worked
for ent in passage_entities:
    print(ent, passage_text[ent[0]:ent[1]])

# prints:
# (105, 129, 'Gene')  estrogen receptor alpha
# (130, 138, 'Gene') (ERalpha
# (150, 155, 'Gene')  ESR1
# (251, 256, 'Gene')  ESR1
# (371, 376, 'Gene')  ESR1
# (579, 585, 'Gene') (SIN3A
# (727, 735, 'Gene')  ERalpha
# (774, 779, 'Gene')  ESR1
# (789, 795, 'Gene')  SIN3A
# (907, 913, 'Gene')  SIN3A
# (946, 952, 'Gene')  SIN3A
# (1060, 1066, 'Gene')  SIN3A
# (1190, 1196, 'Gene')  SIN3A
# (44, 50, 'Species')  women
# (621, 629, 'Species')  patient
# (890, 898, 'Species')  patient
# (256, 267, 'Species')  expression
# (376, 387, 'Species')  expression
# (913, 924, 'DNAMutation')  c.2830 C>T
# (823, 828, 'CellLine')  MCF7
# (1353, 1367, 'Disease')  breast cancer
# (205, 219, 'Disease')  breast cancer
# (602, 616, 'Disease')  breast cancer
# (876, 890, 'Disease')  breast cancer
# (336, 343, 'Disease')  cancer
# (501, 511, 'ProteinMutation')  p.Gln944*
# (989, 995, 'Disease')  tumor
# (35, 41, 'Disease')  tumor
# (489, 500, 'DNAMutation')  c.2830 C>T
# (1117, 1132, 'Disease')  breast cancers
# (90, 97, 'Disease')  tumors
# (0, 14, 'Disease')  Breast cancer

So, the first thing I notice is that some of the given offsets are slightly off, catching (. You could look for if passage_text[ent[0]] == "(" and shift the start of the entity by 1 to clean that, or clean it manually.

Also, this code uses one child node, a passage of the linked doc. You will want to download that doc locally, and instead of passage = ET.fromstring(passage_string), you will create tree = ET.parse('path_to_file'):

Something like

import xml.etree.cElementTree as ET

tree = ET.parse('path_to_file')
root = tree.getroot()
passages = root.findall('./passages')

spacy_data = []

for passage in passages:
    passage_annotations = passage.findall('./annotation')
    passage_offset = int(passage.find('offset').text)
    passage_text = passage.find('text').text

    passage_entities = []
    for ann in passage_annotations:
        entity_type = ann.find('./infon[@key="type"]').text
        od = ann.find('./location').attrib
        start, end = get_entity_offset(od, passage_offset)
        passage_entities.append((start, end, entity_type))

        spacyd_passage = (passage_text, {"entities": passage_entities})
        spacy_data.append(spacyd_package)

This can still be improved upon. You'll want to split those passage.text passages using

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(passage_text)
sents = list(doc.sents)

But the tricky part is you need to do arithmetic to keep the offset indices correct. And you will also want to look at the start and end of each entity to make sure it stays within one sentence - it conceivably could be split by a sentence boundary, though probably not.

How to convert XML NER data from the CRAFT corpus to spaCy's JSON format?

1 Answers1