-1

I have written below code to convert a CSV file to an XML file. Input file has 10 millions records. Problem is it runs for hours with 10 millions records. With less number of records like 2000 it takes 5-10 seconds.

Is there a way to do it efficiently in less time?

import csv
import sys
import os
from xml.dom.minidom import Document

filename = sys.argv[1]
filename = os.path.splitext(filename)[0]+'.xml'
pathname = "/tmp/"
output_file = pathname + filename

f = sys.stdin
reader = csv.reader(f)
fields = next(reader)
fields = [x.lower() for x in fields]
fieldsR = fields

doc = Document()
dataRoot = doc.createElement("rowset")
dataRoot.setAttribute('xmlns:xsi', "http://www.w3.org/2001/XMLSchema-instance")
dataRoot.setAttribute('xsi:schemaLocation', "./schema.xsd")
doc.appendChild(dataRoot)

for line in reader:
    dataElt = doc.createElement("row")           
    for i in range(len(fieldsR)):
        dataElt.setAttribute(fieldsR[i], line[i])
        dataRoot.appendChild(dataElt)


xmlFile = open(output_file,'w')
xmlFile.write(doc.toprettyxml(indent = '\t'))
xmlFile.close()
sys.stdout.write(output_file)
zx485
  • 28,498
  • 28
  • 50
  • 59
  • Lkely the problem is that with this library, you must create the entire tree in memory before writing it out. There are some [streaming writer solutions](https://stackoverflow.com/questions/5377980/iteratively-write-xml-nodes-in-python) for you to explore. – Jongware Mar 12 '20 at 21:04
  • 1
    I think this question could be reopened, but it has been closed for lacking focus. Could you add some timing measurements to this to see what bit of code is adding the inefficiency? I wonder if your XML writer is trying to keep everything in RAM and thus will get increasingly less efficient as the dataset size increases. – halfer Mar 13 '20 at 18:13
  • Specifically, how much time for 10 million records do you consider an acceptable solution? 10 million records in 1 hour? 4 hours? 10 minutes? Trying to stave off answers that reduce your current 8 hour timeframe by only a few minutes... – TylerH Mar 13 '20 at 18:28
  • Note:While you haven't specified anything more accurate than "hours" for how long it takes with 10M records, that appears to be in the ballpark for scaling the time requirement linearly from "2000 it takes 5-10 seconds". If it takes 5s to 10s for 2k records, then 10M records should take 25,000s to 50,000s, which is 6.94h to 13.89h. – Makyen Mar 13 '20 at 18:39

1 Answers1

1

I don't know Python or Minidom, but you seem to be executing the line

dataRoot.appendChild(dataElt)

once for every field in every row, rather than once for every row.

Your performance numbers suggest that something is very wrong here, I don't know if this is it. With 2000 records I would expect to measure the time in milliseconds.

Have to say I'm constantly amazed how people write complex procedural code for this kind of thing when it could be done in half a dozen lines of XSLT or XQuery.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thank Michael. Can you please be specific or pointer to anything where python can be used with XSLT or XQuery. – Victor Mayne Mar 13 '20 at 02:42
  • The traditional, widely used approach is libxslt - see https://stackoverflow.com/questions/16698935/how-to-transform-an-xml-file-using-xslt-in-python. The new kid on the block, offering XSLT 3.0 rather than 1.0, is the Python binding for Saxon/C - see http://www.saxonica.com/saxon-c/documentation/index.html and https://www.saxonica.com/saxon-c/doc/html/saxonc.html – Michael Kay Mar 13 '20 at 08:01