I am using python with hadoop for processing an xml file, I had xml file with the below format
temporary.xml
<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
<table>
<columns>
<column name="campaignID" display="Campaign ID"/>
<column name="adGroupID" display="Ad group ID"/>
<column name="keywordID" display="Keyword ID"/>
<column name="keyword" display="Keyword"/>
</columns>
<row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
<row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
<row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
<row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
</table>
</report>
Now all i want to do is processing the above xml file and later save the data in to MSSQL database.
mapper.py code
import sys
import cStringIO
import xml.etree.ElementTree as xml
if __name__ == '__main__':
buff = None
intext = False
for line in sys.stdin:
line = line.strip()
if line.find("<row>") != -1:
intext = True
buff = cStringIO.StringIO()
buff.write(line)
elif line.find("</>") != -1:
intext = False
buff.write(line)
val = buff.getvalue()
buff.close()
buff = None
print val
Here what all i want to do is fetching the data from row tags
that is the values of campaignID,adgroupID,keywordID,keyword
and printing them which becomes as an input to reducer.py
(That consists of code to save data in database).
I had looked in to some examples but in that the tags are like <tag> </tag>
, but in my case i had only <row/>
But my code above is not working/not printing anything, can anyone please correct my code and add the necessary python code to get the values/data from the row tags(i am very very extremely new to hadoop), so that will extend the code from next time.