3

I am very new to both Hadoop and Pig. I have been able to do a number of simple programs but one which is taxing me is processing XML when part of an XML file is malformed.

I can use XMLLoader('tag') to get all of the tags from an xml file which is great. However if one is missing a well formed close tag pig will stop at that one. for example

<tag>
</tag>
<tag>
</tag1>
<tag>
</tag>

This will only pick up the first valid tag. Now, I have experience with JAQL and am able to ignore the error record so that the application picks up the second tag.

My question is: is their was a way to do handle poor formatting of XML using Pig, rather than JAQL?

C4 - Travis
  • 4,502
  • 4
  • 31
  • 56

1 Answers1

0

I've been looking at the pig XMLLoader code, and what appears to be happening with the malformed tag is that the loader is never noticing that the tag ends, and has no way of noticing that it has entered a new main tag. There appears to be no way to use the XMLLoader as it currently stands to get around this.

It might however be possibble to modify XMLLoader so that it works in the manner you want it to. Probably by changing the conditions in the skipToTag method so that if it runs into another instance of the specified opening tag it skips ahead to that, ignoring the malformed tag. Keep in mind that this will mess up if you have nested tags with the same name (ex. address as root, but have address as an element lower in the doc), so it isn't foolproof.

It would seem however that in most cases validating the XML beforehand might be a better option, or having a pre-processor extract only the valid XML to a file which pig then runs on.

Hope this helps.

Davis Broda
  • 4,102
  • 5
  • 23
  • 37