Extract from a match to next match if patten found in between

Question

I am beginner in python. I am struggling with a problem which is explained below. I am sharing incomplete python script also which does not work for this problem. I would be grateful if get support or instruction for my script.

File looks like this:

<Iteration>
  <Iteration_hit>Elememt1 Element1
    abc1 hit 1
  .
  .
</Iteration>
<Iteration>
  <Iteration_hit>Elememt2 Element2
    abc2 hit 1
  .
  .
</Iteration>
<Iteration>
  <Iteration_hit>Elememt3 Element3
    abc3 hit 1
  .
  .
</Iteration>
<Iteration>
  <Iteration_hit>Elememt4 Element4
    abc4 hit 1
  .
  .
</Iteration>

I need from <Iteration> to </Iteration> for Elements list match, which means for Element2 and Element4 the output file should look like this:

<Iteration>
  <Iteration_hit>Elememt2 Element2
    abc2 hit 1
  .
  .
</Iteration>
<Iteration>
  <Iteration_hit>Elememt4 Element4
    abc4 hit 1
  .
  .
</Iteration>

Script

#!/usr/bin/python
x = raw_input("Enter your xml file name: ")
xml = open(x)
l = raw_input("Enter your list file name: ")
lst = open(l)
Id = list()
ylist = list()
import re
for line in lst:
        stuff=line.rstrip()
        stuff.split()
        Id.append(stuff)
for ele in Id:
        for line1 in xml:
                if line1.startswith("  <Iteration_hit>"):
                        y = line1.split()
#                       print y[1]
                        if y[1] == ele: break

You do know that there are libraries to read/write xml files, right? — tglaria, Jan 14 '16 at 15:28
Don't use regular expressions to parse XML. Python ships with an `xml` package just for this purpose. — Joel Cornett, Jan 14 '16 at 15:29
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Łukasz Rogalski, Jan 14 '16 at 15:46

score 0 · Accepted Answer · answered Jan 14 '16 at 15:32

It isn't recommended to use regex to parse XML - you should use a library such as lxml, which you can install using pip install lxml. Then, you could select the appropriate elements to output using lxml and XPath as follows (I have taken the liberty of closing the <Iteration_hit> tags in your XML):

content = '''
<root>
<Iteration>
  <Iteration_hit>Elememt1 Element1
    abc1 hit 1
  </Iteration_hit>
</Iteration>
<Iteration>
  <Iteration_hit>Elememt2 Element2
    abc2 hit 1
  </Iteration_hit>
</Iteration>
<Iteration>
  <Iteration_hit>Elememt3 Element3
    abc3 hit 1
  </Iteration_hit>
</Iteration>
<Iteration>
  <Iteration_hit>Elememt4 Element4
    abc4 hit 1
  </Iteration_hit>
</Iteration>
</root>
'''

from lxml import etree

tree = etree.XML(content)
target_elements = tree.xpath('//Iteration_hit[contains(., "Element2") or contains(., "Element4")]')

for element in target_elements:
    print(etree.tostring(element))

Output

<Iteration_hit>Elememt2 Element2
    abc2 hit 1
  </Iteration_hit>

<Iteration_hit>Elememt4 Element4
    abc4 hit 1
  </Iteration_hit>

Happy to help, and welcome to Stack Overflow. If this answer or any other one solved your issue, please mark it as accepted. — gtlambert, Jan 14 '16 at 18:21

score 0 · Answer 2 · answered Jan 14 '16 at 21:43

Here is the desired complete script for xml parsing through Python

#!/usr/bin/python
from lxml import etree

with open('input.xml', 'r') as myfile:
    content=myfile.read().replace('\n', '\n')


lst = open('ID.list')
Id = list()
for line in lst:
    stuff=line.rstrip()
    stuff.split()
    Id.append(stuff)
for ele in Id:
    tree = etree.XML(content)
    target_elements = tree.xpath('//Iteration[contains(., ele)]')

for element in target_elements:
    print(etree.tostring(element))

Extract from a match to next match if patten found in between

2 Answers2