I've been attempting to write an algorithm that runs a diff of two XML files in the following way:
- Takes in 2 XML files and parse them as trees using lxml
- Transform each XML Element into a node
- Find which nodes are unchanged, moved, changed, added/deleted and label them as such
- Print out the results (which I haven't done yet)
I'm basing my algorithm off of this Github code and am basically rewriting his code in my own words to understand it.
My algorithm works perfectly, but it chokes on large files (20MB+) and takes 40 minutes (whereas it takes 2 minutes on a 17MB file, frustratingly enough).
My algorithm would execute much faster if I was able to just use more CPU (my code uses all of the 12.5% in the processor). I considered multiprocessing but ran into the problem where "lxml cannot be pickled (as for now) so it cannot be transferred between processes by multiprocessing package". I've read up on what pickling is but am struggling to figure out a solution.
Are there any workarounds that would help solve my problem? Any and all suggestions would be greatly appreciated! :)