-1

Can someone tell me how to assign jobs to multiple threads to speed up parsing time? For example, I have XML file with 200k lines, I would assign 50k lines to each 4 threads and parse them using SAX parser. What I have done so far is 4 threads parsing on 200k lines which means 200k*4 = 800k duplicating results.

Any help is appreciated.

test.xml:

<?xml version="1.0" encoding="utf-8"?>
<votes>
  <row Id="1" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="2" PostId="1" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="3" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
  <row Id="5" PostId="3" VoteTypeId="2" CreationDate="2014-05-13T00:00:00.000" />
</votes>

My source code:

import json  
import xmltodict  
from lxml import etree
import xml.etree.ElementTree as ElementTree
import threading
import time

def sax_parsing():

    t = threading.currentThread()

    for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml"):
        #below codes read the attributes in an element specified
        if element.tag == 'row':
            print("Thread: %s" % t.getName())
            row_id = element.attrib.get('Id')
            row_post_id = element.attrib.get('PostId')
            row_vote_type_id = element.attrib.get('VoteTypeId')
            row_user_id = element.attrib.get('UserId')
            row_creation_date = element.attrib.get('CreationDate')
            print('ID: %s, PostId: %s, VoteTypeID: %s, UserId: %s, CreationDate: %s'% (row_id,row_post_id,row_vote_type_id,row_user_id,row_creation_date))
            element.clear()  

    return

if __name__ == "__main__":  

    start = time.time() #calculate execution time

    main_thread = threading.currentThread()
    no_threads = 4
    for i in range(no_threads):
        t = threading.Thread(target=sax_parsing)
        t.start()

    for t in threading.enumerate():
        if t is main_thread:
            continue

    t.join()

    end = time.time() #calculate execution time
    exec_time = end - start
    print('Execution time: %fs' % (exec_time))
martineau
  • 119,623
  • 25
  • 170
  • 301
Lewis Wong
  • 93
  • 10
  • Maybe try parsing first, then split and thread. – Tyler Christian Sep 12 '17 at 13:51
  • when you're doing `for event, element in etree.iterparse("/home/xiang/Downloads/FYP/parallel-python/test.xml")` you are giving all the threads the same xml file to parse. maybe cut the test.xml file into 4 parts? – Avishay Cohen Sep 12 '17 at 13:52
  • The `threading` module doesn't have a function called `currentThread()`. It does have one named `current_thread()`. – martineau Sep 12 '17 at 14:38

1 Answers1

0

simplest way you could expend your parse function to receive start row and end row like so: def sax_parsing(start, end):

and then when sending the threading command: t = threading.Thread(target=sax_parsing, args=(i*50, i+1*50))

and change if element.tag == 'row': to if element.tag == 'row' and element.attrib.get('Id') >= start and element.attrib.get('Id') < end:

so each thread checks just the rows it was given in the range (didn't actually check this, so play around)

Avishay Cohen
  • 1,978
  • 2
  • 21
  • 34