processing xml file with hadoop using python

Question

I am using python with hadoop for processing an xml file, I had xml file with the below format

temporary.xml

<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
  <table>
    <columns>
       <column name="campaignID" display="Campaign ID"/>
       <column name="adGroupID" display="Ad group ID"/>
       <column name="keywordID" display="Keyword ID"/>
       <column name="keyword" display="Keyword"/>
    </columns>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
  </table>
</report>

Now all i want to do is processing the above xml file and later save the data in to MSSQL database.

mapper.py code

import sys
import cStringIO
import xml.etree.ElementTree as xml

if __name__ == '__main__':
    buff = None
    intext = False
    for line in sys.stdin:
        line = line.strip()
        if line.find("<row>") != -1:
            intext = True
            buff = cStringIO.StringIO()
            buff.write(line)
        elif line.find("</>") != -1:
            intext = False
            buff.write(line)
            val = buff.getvalue()
            buff.close()
            buff = None
            print val

Here what all i want to do is fetching the data from row tags that is the values of campaignID,adgroupID,keywordID,keyword and printing them which becomes as an input to reducer.py(That consists of code to save data in database).

I had looked in to some examples but in that the tags are like <tag> </tag>, but in my case i had only <row/>

But my code above is not working/not printing anything, can anyone please correct my code and add the necessary python code to get the values/data from the row tags(i am very very extremely new to hadoop), so that will extend the code from next time.

score 0 · Answer 1 · edited May 23 '17 at 12:04

Have you considered using xpath? It is a mini language you can use for getting around an xml tree. It can be used easily from within python.

http://docs.python.org/2/library/xml.etree.elementtree.html might be of use to you

You might also want to look at Need Help using XPath in ElementTree

Here's how I would do it (This is valid Python code. I tested it in Python3.2. works fine with your example xml):

import xml.etree.ElementTree as xml #you had this line in your code. I am not using any tool you  do not have access to in your script

def get_row_attributes(the_xml_as_a_string):
    """
    this function takes xml as a string. 
    It can work with xml that looks like your included example xml.
    This function returns a list of dictionaries. Each dictionary is made up of the attributes of each row. So the result looks like:
     [
          {attribute_name:value_for_first_row,attribute_name:value_for_first_row...},
          {attribute_name:value_for_second_row,attribute_name:value_for_second_row...},
          etc
     ]
    """
    tree = xml.fromstring(the_xml_as_a_string)
    rows = tree.findall('table/row')  # 'table/row' is xpath. it means get all the rows in all the tables
    return [row.attrib for row in rows]

To use this function read std in and build up a string. Call get_row_attributes(the_xml_as_a_string)

The resulting dictionaries contain the information you requested (the attributes of the rows).

So now we have

read stuff from std-in
gotten all the information about all the rows

All using completely normal python

The last thing to do is write it to your other process. If you need help with this part please include information about what format the data should be in and where it should go

:Thanks for you precious reply, but i am trying this through haddop and want to save to database actually, so can u please add the code to get values from row tag in terms of hadoop and python — Shiva Krishna Bavandla, Nov 07 '12 at 11:43
@shivakrishna: I've added some comments for clarity. If you need anything else please be specific — Sheena, Nov 07 '12 at 13:24

processing xml file with hadoop using python

1 Answers1