2

I have stack overflow data dump file in .xml format,nearly 27GB and I want to convert them in .csv file. Please somebody tell me, tools to convert xml to csv file or python program

Md Salim
  • 51
  • 6

2 Answers2

0

Use one of the python xml modules to parse the .xml file. Unless you have much more that 27GB ram, you will need to do this incrementally, so limit your choices accordingly. Use the csv module to write the .csv file.

Your real problem is this. Csv files are lines of fields. They represent a rectangular table. Xml files, in general, can represent more complex structures: hierarchical databases, and/or multiple tables. So your real problem to to understand the data dump format well enough to extract records to write to the .csv file.

Terry Jan Reedy
  • 18,414
  • 3
  • 40
  • 52
0

I have written a PySpark function to parse the .xml in .csv. XmltoCsv_StackExchange is the github repo. Used it to convert 1 GB of xml within 2-3 minutes on a minimal 2-core and 2 GB RAM Spark setup. It can convert 27GB file too, just increase minPartitions from 4 to around 128 in this line.

raw = (sc.textFile(fileName, 4))