I am using Hive 0.13.1 and I created a custom SerDe that is able to process a special kind of xml data. So far so good. I also created a class for the InputFormat that splits the input data.
Is it possible that I produce multiple rows (output) in the deserialize() function of my custom SerDe (or somewhere else in my SerDe)?
So that I am able to create e.g. two rows out of one split? In the deserialize function as far as I can see (in other SerDe classes), the return value is only a List (the values of one row) and that will be displayed as one row.
Lets say I have a xml like this:
<item>
<id>0</id>
<timestamp>00:00:00</timestamp>
<subitemlist>
<subitem>1</subitem>
<subitem>2</subitem>
</subitemlist>
</item>
My SerDe gets the whole item block and what I want to do now is to create a row for each <subitem>
with the id of <item>
in Hive.
I can't adapt the InputFormat class because the problem is not as trivial as it is in this example :)