1

I would like to be able to process a paragraph by sentence in xml format that does not specifiy sentences. My input looks like this:

<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/"> 
Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>

I wish my input was something that looks more like:

<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">
<s>Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s><s>Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s></p>

So that I can extract these whole like:

<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> </s>

<s xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/">Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></s>

My test code is:

from lxml import etree

if __name__=="__main__":

  xml1 = '''<p xmlns="https://jats.nlm.nih.gov/ns/archiving/1.0/"> 
Recently, a first step in this direction has been taken
in the form of the framework called &#8220;dynamical fingerprints&#8221;,
which has been developed to relate the experimental and MSM-derived
kinetic information.<sup><xref ref-type="bibr" rid="ref56">56</xref></sup> Several research
groups are now focused on developing protocols to systematically cross-validate
the MSM predictions and obtain MSM parameters using an optimization
protocol that produces the best estimate of the few slowest dynamics
modes of the protein dynamics.<sup><xref ref-type="bibr" rid="ref57">57</xref></sup></p>
'''


  print xml1

  root = etree.XML(xml1)
  sentences_info = []
  for sentence in root:
    # I want to do more fun stuff here with the result
    sentence_text = sentence.text
    ref_ids = []
    for reference in sentence.getchildren():
        if 'rid' in reference.attrib.keys():
            ref_id = reference.attrib['rid']
            ref_ids.append(ref_id)
    sent_par = {'reference_ids': ref_ids,'text': sentence_text}
    sentences_info.append(sent_par)
    print sent_par
Parfait
  • 104,375
  • 17
  • 94
  • 125
rbf22
  • 11
  • 2

2 Answers2

0

Converting BeautifulSoup objects into strings and then cleaning with regex works well. For example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('yourlink.com'), 'lxml')

paragraphs = str(soup.findAll('p')) #turn the soup object into a string

sentences = paragraphs.split('<sup><xref ref-type="bibr" rid="ref56">56</xref></sup>') #creates a list of sentences

clean = []
for e in sentences:
    e = re.sub(r'(<.*?>)', '', e) #gets rid of the tags
    clean.append(e)

As far as I know, there's no built-in way to deal with sentences in xml, and it requires its own makeshift solution.

snapcrack
  • 1,761
  • 3
  • 20
  • 40
  • This is way to specific a fix for me to use. I need something very general. I will open another question along similar lines – rbf22 Jun 17 '17 at 21:59
  • You're likely not going to get a "very general" way to grab sentences given your data. There's not a tool for that in the xml module, so you have to tailor-make solutions. – snapcrack Jun 17 '17 at 22:04
  • OK, I will build a solution and hopefully, the community can help me to clean it up. Thanks for the help! – rbf22 Jun 17 '17 at 22:21
  • No problem. I'm also happy to take another crack at it if you specify what you need to try to generalize within the solution. – snapcrack Jun 17 '17 at 22:39
0

This is when you are parsing XML, it still contains namespace. Basically, each XML the you parse will have the elements as:

<Element {https://jats.nlm.nih.gov/ns/archiving/1.0/}p at 0x108219048>

You can remove namespace from XML using this following function:

from lxml import etree

def remove_namespace(tree):
    for node in tree.iter():
        try:
            has_namespace = node.tag.startswith('{')
        except AttributeError:
            continue  # node.tag is not a string (node is a comment or similar)
        if has_namespace:
            node.tag = node.tag.split('}', 1)[1]

Then parsing XML and remove namespace

tree = etree.fromstring(xml1)
remove_namespace(tree) # remove namespace
tree.findall('sup') # output as [<Element sup at 0x1081d73c8>, <Element sup at 0x1081d7648>]
titipata
  • 5,321
  • 3
  • 35
  • 59