How to build a named entity recognition(NER) model using spaCy for biomedical NER on CRAFT corpus?
It is difficult for me to pre-process the xml
files given in that corpus to any format used by spacy
, any little help would be highly appreciated.
I first converted the xml
files to json
format but that was not accepted by spacy
. What format of training data does spacy
expect? I even tried to build my own NER
model but was not able to pre-process the xml
files as given in this article.
Here is an example of training an NER model using spacy, including the expected format of training data (from spacy's docs):
import random
import spacy
TRAIN_DATA = [
("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
nlp = spacy.blank("en")
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk("/model")
The XML file I am using is available online here. An example record looks like:
<passage>
<infon key="section_type">ABSTRACT</infon>
<infon key="type">abstract</infon>
<offset>141</offset>
<text>
Breast cancer is the most frequent tumor in women, and in nearly two-thirds of cases, the tumors express estrogen receptor alpha (ERalpha, encoded by ESR1). Here, we performed whole-exome sequencing of 16 breast cancer tissues classified according to ESR1 expression and 12 samples of whole blood, and detected 310 somatic mutations in cancer tissues with high levels of ESR1 expression. Of the somatic mutations validated by a different deep sequencer, a novel nonsense somatic mutation, c.2830 C>T; p.Gln944*, in transcriptional regulator switch-independent 3 family member A (SIN3A) was detected in breast cancer of a patient. Part of the mutant protein localized in the cytoplasm in contrast to the nuclear localization of ERalpha, and induced a significant increase in ESR1 mRNA. The SIN3A mutation obviously enhanced MCF7 cell proliferation. In tissue sections from the breast cancer patient with the SIN3A c.2830 C>T mutation, cytoplasmic SIN3A localization was detected within the tumor regions where nuclear enlargement was observed. The reduction in SIN3A mRNA correlates with the recurrence of ER-positive breast cancers on Kaplan-Meier plots. These observations reveal that the SIN3A mutation has lost its transcriptional repression function due to its cytoplasmic localization, and that this repression may contribute to the progression of breast cancer.
</text>
<annotation id="38">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="246" length="23"/>
<text>estrogen receptor alpha</text>
</annotation>
<annotation id="39">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="271" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="40">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="291" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="41">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="392" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="42">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="512" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="43">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="720" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="44">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="868" length="7"/>
<text>ERalpha</text>
</annotation>
<annotation id="45">
<infon key="identifier">2099</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">47906</infon>
<location offset="915" length="4"/>
<text>ESR1</text>
</annotation>
<annotation id="46">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="930" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="47">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1048" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="48">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1087" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="49">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1201" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="50">
<infon key="identifier">25942</infon>
<infon key="type">Gene</infon>
<infon key="NCBI Homologene">32124</infon>
<location offset="1331" length="5"/>
<text>SIN3A</text>
</annotation>
<annotation id="51">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="185" length="5"/>
<text>women</text>
</annotation>
<annotation id="52">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="762" length="7"/>
<text>patient</text>
</annotation>
<annotation id="53">
<infon key="identifier">9606</infon>
<infon key="type">Species</infon>
<location offset="1031" length="7"/>
<text>patient</text>
</annotation>
<annotation id="54">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="397" length="10"/>
<text>expression</text>
</annotation>
<annotation id="55">
<infon key="identifier">29278</infon>
<infon key="type">Species</infon>
<location offset="517" length="10"/>
<text>expression</text>
</annotation>
<annotation id="56">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="1054" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="57">
<infon key="identifier">CVCL:0031</infon>
<infon key="type">CellLine</infon>
<location offset="964" length="4"/>
<text>MCF7</text>
</annotation>
<annotation id="58">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1494" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="59">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="346" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="60">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="743" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="61">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1017" length="13"/>
<text>breast cancer</text>
</annotation>
<annotation id="62">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="477" length="6"/>
<text>cancer</text>
</annotation>
<annotation id="63">
<infon key="identifier">p.Q944*</infon>
<infon key="type">ProteinMutation</infon>
<location offset="642" length="9"/>
<text>p.Gln944*</text>
</annotation>
<annotation id="64">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="1130" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="65">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="176" length="5"/>
<text>tumor</text>
</annotation>
<annotation id="66">
<infon key="identifier">c.2830C>T</infon>
<infon key="type">DNAMutation</infon>
<location offset="630" length="10"/>
<text>c.2830 C>T</text>
</annotation>
<annotation id="67">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="1258" length="14"/>
<text>breast cancers</text>
</annotation>
<annotation id="68">
<infon key="identifier">MESH:D009369</infon>
<infon key="type">Disease</infon>
<location offset="231" length="6"/>
<text>tumors</text>
</annotation>
<annotation id="69">
<infon key="identifier">MESH:D001943</infon>
<infon key="type">Disease</infon>
<location offset="141" length="13"/>
<text>Breast cancer</text>
</annotation>
</passage>