3

I am using Hive 0.13.1 and I created a custom SerDe that is able to process a special kind of xml data. So far so good. I also created a class for the InputFormat that splits the input data.

Is it possible that I produce multiple rows (output) in the deserialize() function of my custom SerDe (or somewhere else in my SerDe)?

So that I am able to create e.g. two rows out of one split? In the deserialize function as far as I can see (in other SerDe classes), the return value is only a List (the values of one row) and that will be displayed as one row.

Lets say I have a xml like this:

<item>
  <id>0</id>
  <timestamp>00:00:00</timestamp>
  <subitemlist>
    <subitem>1</subitem>
    <subitem>2</subitem>
  </subitemlist>
</item>

My SerDe gets the whole item block and what I want to do now is to create a row for each <subitem> with the id of <item> in Hive.

I can't adapt the InputFormat class because the problem is not as trivial as it is in this example :)

S. Walz
  • 31
  • 1

2 Answers2

0

No, it's not possible. The SerDe interface serializes/deserializes one record at a time because that's what serialization is supposed to do. In general, it is not a good design decision to have a SerDe to actually transform data, that's what queries, UDFs and UDTFs are for. The purpose of a SerDe is basically to map a data format to an equivalent hive schema.

I think the best way to do it is to have a table like

create table xmltable ( 
  id int,
  ts timestamp,
  subitems array<int>
)

using something with this serde and then create another table as a view

CREATE myview AS
  select id, sb FROM xmltable LATERAL VIEW explode(subitems) sb1 AS sb
Roberto Congiu
  • 5,123
  • 1
  • 27
  • 37
0

Ok thanks for your answer Roberto.

In general, it is not a good design decision to have a SerDe to actually transform data, that's what queries, UDFs and UDTFs are for

Yeah probably you are right. The problem is that I need to do some processing based on the data of many columns. So a UDF would increase the complexity of this too much. But still, thanks for the answer.

I now solved it by adapting the next()-method in my InputFormat-class. (I know I said I didn't want to do this, but ...). So I'm analysing the <item> tag and for every <subitem> I return the whole item to the SerDe.

S. Walz
  • 31
  • 1