3

I've been attempting to write an algorithm that runs a diff of two XML files in the following way:

  1. Takes in 2 XML files and parse them as trees using lxml
  2. Transform each XML Element into a node
  3. Find which nodes are unchanged, moved, changed, added/deleted and label them as such
  4. Print out the results (which I haven't done yet)

I'm basing my algorithm off of this Github code and am basically rewriting his code in my own words to understand it.

My algorithm works perfectly, but it chokes on large files (20MB+) and takes 40 minutes (whereas it takes 2 minutes on a 17MB file, frustratingly enough).

My algorithm would execute much faster if I was able to just use more CPU (my code uses all of the 12.5% in the processor). I considered multiprocessing but ran into the problem where "lxml cannot be pickled (as for now) so it cannot be transferred between processes by multiprocessing package". I've read up on what pickling is but am struggling to figure out a solution.

Are there any workarounds that would help solve my problem? Any and all suggestions would be greatly appreciated! :)

Anthony
  • 311
  • 1
  • 3
  • 13
  • 2
    [This answer](https://stackoverflow.com/a/25994232/190597) shows a way to pickle (and unpickle) lxml Elements. – unutbu Jul 09 '19 at 02:53
  • 2
    It is unclear (to me), however, if multiprocessing can be used advantageously here. Problems which parallelize well usually (maybe always?) keep the overhead of interprocess communication small compared to the amount of time spent doing concurrent processing. The pickler in the above link serializes the Elements to StringIOs and unpickles them with `etree.fromstring` on the other end. Especially when these StringIOs are quite large, all this interprocess communication is going to be a significant drag on performance. – unutbu Jul 09 '19 at 03:04
  • Related: [Parsing Very Large XML Files Using Multiprocessing](https://stackoverflow.com/q/21380884/190597) – unutbu Jul 09 '19 at 03:09
  • @unutbu I'm still very new to this multiprocessing/pickling stuff but I think I understand what you're saying. The files I'm parsing are read quickly (they're not extremely complex and under 300MB) and I've mostly been trying to use multiprocessing to speed up Step 3 of my algorithm. – Anthony Jul 09 '19 at 13:27
  • @unutbu The way the multiprocessing would work (in my head) would be to parse the XML files first and then send different nodes to different processes to be labeled. I've been experimenting with pool.apply_async but it requires using a .get() method after the process completes, which I'm not sure how to use because of the void method type of my "node labeling method". Thanks so much for your answer! – Anthony Jul 09 '19 at 13:30

0 Answers0