0

I have a xml similar to following xml. I am trying get elements of name "elem" based on attribute "id" of some range.

Eg: get all "elem" elements from id=4 to id = 8.

<all_levels>
<level1>
    <level2>
        <level3>
        <elem id="1"> </elem>
        <elem id="2"> </elem>
        </level3>
        <level3>
        <elem id="3"> </elem>
        <elem id="4"> </elem>
        </level3>
    </level2>
    <level2>
        <level3>
        <elem id="5"> </elem>
        <elem id="6"> </elem>
        </level3>
        <level3>
        <elem id="7"> </elem>
        <elem id="8"> </elem>
        </level3>
    </level2>
</level1>
<level1>
    <level2>
        <level3>
        <elem id="9"> </elem>
        <elem id="10"> </elem>
        </level3>
        <level3>
        <elem id="11"> </elem>
        <elem id="12"> </elem>
        </level3>
    </level2>
    <level2>
        <level3>
        <elem id="13"> </elem>
        <elem id="14"> </elem>
        </level3>
        <level3>
        <elem id="15"> </elem>
        <elem id="16"> </elem>
        </level3>
    </level2>
</level1>
</all_levels>

I have tried two methods: 1) Using xpath to get required "elem" elements like getting elements from range (4,8)

from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
elem1 = sample_xml.xpath("//word[@id = '%s']" % str(4))[0]
elem2 = sample_xml.xpath("//word[@id = '%s']" % str(5))[0]
elem3 = sample_xml.xpath("//word[@id = '%s']" % str(6))[0]
elem4 = sample_xml.xpath("//word[@id = '%s']" % str(7))[0]
elem5 = sample_xml.xpath("//word[@id = '%s']" % str(8))[0]

but if range is large , it is taking too much time to get all elements.

2)use xpath to get first elem in the range, the use getnext() method to get sibilings

from lxml import etree
sample_xml = etree.parse("sample_xml.xml")
elem1 = sample_xml.xpath("//word[@id = '%s']" % str(4))[0]
elems = [elem1]
curr_word = elem1
current_id = 4
while(current_id <= 8):
    curr_elem = curr_word.getnext()
    elems.append(curr_elem)
    current_id += 1

but the problem is getnext() only gets elem in the same tree. so it cannot get all other elems.

Is there a better way to get elems in a range better than using xpath?

Satheesh K
  • 501
  • 1
  • 3
  • 16
  • How much is "too much time"? – mzjn Jun 04 '19 at 10:26
  • @mzjn too much time in my case is 1 to 2 minutes for xml containing 6000 "elem" elements in range (70,2000) . – Satheesh K Jun 04 '19 at 10:52
  • By the I got answer from a [similar question](https://stackoverflow.com/questions/3354987/what-is-the-xpath-to-select-a-range-of-nodes) using xpath only. The answer is we can use range to get list of elements like `elem_list = xml_etree.xpath("//elem[@id >= '%d' and @id <= '%d']" % (range_start,range_end))` . It takes much much less time. – Satheesh K Jun 04 '19 at 10:54
  • @What exactly is your expected output? – Jack Fleeting Jun 04 '19 at 11:00
  • my excepted output wast to get all the elements with name "elem" whose attribute "id" fall in particular range and i got it using `elem_list = xml_etree.xpath("//elem[@id >= '%d' and @id <= '%d']" % (range_start,range_end))` – Satheesh K Jun 04 '19 at 11:02
  • 2
    If you have a solution, you should post it as an answer. Explain why you chose that solution and what "much less time" means. – mzjn Jun 04 '19 at 11:44
  • @mzjn, thanks for asking me to be more specific. Just now answered. Please check in case you are curious. – Satheesh K Jun 04 '19 at 13:04

1 Answers1

1

It seems like we can get all "elem" whose attribute "id" fall in particular range using xpath efficiently.

Below are the two methods. I have used cell magic command "%%time" to measure how much time each approach took.

from lxml import etree
sample_xml = etree.parse("sample_xml.xml")

Method 1:

%%time
start_heading_id = 4
ending_heading_id = 1000
elem1 = sample_xml.xpath("//elem[@id = '%s']" % str(start_heading_id))[0]
elems = [elem1]
curr_word = elem1
current_id = start_heading_id
while(current_id <= ending_heading_id):
    curr_elem = sample_xml.xpath("//elem[@id = '%s']" % str(current_id+1))[0]
    elems.append(curr_elem)
    current_id += 1

Output(took 13.2 seconds to get all elements):

CPU times: user 13.2 s, sys: 23.6 ms, total: 13.2 s
Wall time: 13.2 s

Method 2:

%%time
start_heading_id = 4
ending_heading_id = 1000
elems = sample_xml.xpath("//elem[@id >= '%d' and @id <= '%d']" % (start_heading_id,ending_heading_id))

Output(took 0.00387 seconds to get all elements):

CPU times: user 39.2 ms, sys: 1.25 ms, total: 40.5 ms
Wall time: 38.7 ms
Satheesh K
  • 501
  • 1
  • 3
  • 16