I have around 1000 XML files each of 250 MB size. I need to extract some data from them and write to CSV. There cannot not be any duplicate entries.
I have a system with 4GB RAM and an AMD A8 processor.
I have already gone through some previous posts here but they don't seem to answer my problem.
I have already written the code in Python and tested it on a sample XML and it worked well.
However it was very slow (almost 15 mins for each file) when I used it on all files and had to terminate the process midway.
What can be an optimal solution to speed up the process?
Here's the code
path='data/*.xml'
t=[]
for fname in glob.glob(path):
print('Parsing ',fname)
tree=ET.parse(fname)
root=tree.getroot()
x=root.findall('//Article/AuthorList//Author')
for child in x:
try:
lastName=child.find('LastName').text
except AttributeError:
lastName=''
try:
foreName=child.find('ForeName').text
except AttributeError:
foreName=''
t.append((lastName,foreName))
print('Parsed ',fname)
t=set(t)
I want the fastest method to get the entries without any duplicate values. (Maybe storing in some DB instead of variable t, Will storing each entry in DB speed up due to more free RAM ?- whatever be the method I need direction towards it)