0

I'm reading through an xml file to find four bits of data that I'm then updating in a database. My question is how to more efficiently search for them. Is line by line the best in this case? I'm currently reading the whole file into a variable and then doing a search on it.

with open(cur_file, 'rb') as xml_file:
    bill_mgr_email_re = re.compile(r'<BillingManagerInformation .* Email="(.*.com)"')
    num_bills_re = re.compile(r'NumberBills="(\d+)"')
    num_ebills_re = re.compile(r'NumberOfEbills="(\d+)"')
    num_mailed_re = re.compile(r'NumberOfMailedDocs="(\d+)"')
    data = xml_file.read()

    bill_mgr_email = bill_mgr_email_re.search(data).group(1)
    num_bills = num_bills_re.search(data).group(1)
    num_ebills = num_ebills_re.search(data).group(1)
    num_mailed = num_mailed_re.search(data).group(1)
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
flybonzai
  • 3,763
  • 11
  • 38
  • 72
  • Why not use a parser to process the xml instead? – hwnd Sep 28 '15 at 23:44
  • Would that be more efficient if I only need 4 bits of data? – flybonzai Sep 28 '15 at 23:45
  • 1
    Iterating over each line in a file is generally preferred over reading the whole thing into memory at once, mostly because you can then process files that are larger than your computer's available memory without having a memory problem. If the file is small, though, it really doesn't matter. – TigerhawkT3 Sep 28 '15 at 23:54
  • 1
    Using something like [`xml.etree.ElementTree.iterparse`](https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse) should be similarly efficient to using regular expressions (and as a side-benefit, not pull the whole file into memory, and only scan it once). Plus it's actually likely to be correct (rather than confusing attributes for text and what have you). [Never use regular expressions to parse XML/HTML/whatever](https://stackoverflow.com/a/1732454/364696). – ShadowRanger Sep 29 '15 at 00:04
  • Why not try a couple of methods and profile them? – skrrgwasme Sep 29 '15 at 00:05
  • 1
    If you can make really strong assumptions about your data, then regexes can work for your specific use case. If you're unsure about the format of the data, then using regexes could easily mean that you miss elements whose format violates your assumptions. In that case, use a SAX parser instead. – beerbajay Sep 29 '15 at 00:10

0 Answers0