Hive: converting a comma separated string to array for table generating function

Question

I am creating a Hive table on Amazon's Elastic MapReduce by using a gzipped JSON encoded file. I am using this JSON SerDe: http://code.google.com/p/hive-json-serde/

The unencoded file looks like this:

{"id":"101", "items":"A:231,234,119,12"}

{"id":"102", "items":"B:13,89,121"}

...

I'd like to create an array of the "items" column for user with a table generating function. The array I want would be the "exploded" CSV of ints ignoring the ":" and the letter before it. I want to be able to GROUP BY the item ints (231,234,etc.) or JOIN them against other tables. Changing around the format of this file would be difficult and all my old data is stored in this format so I'd like to solve this in Hive without making changes to how to store my data.

Is there a way for me to do this?

Edit:

Thanks @mark-grover, this worked!

I did something like this:

SELECT id, item FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item

to get:

In doing this I found out I'd like to do a little more. I'd like to also select the position that each item was in in the CSV. So I'd like the output to look like:

Any ideas here?

Edit again:

I came up with something for my followup question doing something like this:

SELECT id, item, find_in_set(item, substr(items, 3, length(items) - 2)) AS position
FROM my_table LATERAL VIEW explode(split(substr(items, 3, length(items) - 2), ',')) t1 AS item

Mark Grover · Accepted Answer · 2012-12-11T23:39:00.853

4

You can do that use Hive UDFs. You would want to use the split UDF to split the string on commas to get an array and then use Lateral view along with explode UDTF to "explode" the contents of the array.

To answer your second question: This is what I can think of right now. There might be a better/easier/less hackier way to do it.

Create a UDF say split_extended that would return as a struct with 2 elements, first one being the actual contents of the array element, second one being the index. When you explode this the type of exploded column will be a struct. One of the entries of this struct would the actual element from the array, the other would be the index corresponding to the element.

If you think about it, we are augmenting split UDF to return 2 pieces of information - the element and the index. These are several ways of going about returning such information as a single object - structure is one of them, array being another (first element of the array being the actual element from the array, second element being the index). Another hackier way to return 2 pieces of information is to concat them with something (say a ':') that you know doesn't exist in your data so then you can have your UDF return '231:0', '234:0', '119:2', etc. and in the end instead of accessing them members of a struct, you can split the resulting string out based on ':' (using split UDF) to get the desired two pieces.

For creating your own UDF, you will be able to leverage most, if not all of split UDF code available here.

edited Dec 11 '12 at 23:39

answered Dec 07 '12 at 06:53

Mark Grover

4,070
22
21

Thanks, Mark! I have a follow up question that may be difficult or impossible (see my edits to the original question). Please let me know if you have any ideas. – dubois Dec 11 '12 at 02:59
I edited my answer to answer your second question. Good luck! – Mark Grover Dec 11 '12 at 23:39
Ah, thanks, Mark--especially for the pointer to the UDF code. I ended up figuring something out that uses find_in_set() so I won't have to write my own UDF. Your help is greatly appreciated. – dubois Dec 12 '12 at 04:25
Glad to have helped. Good luck! – Mark Grover Dec 13 '12 at 09:15
Hi Mark, can you please take a look at this question of mine? http://stackoverflow.com/questions/18624991/is-there-a-way-to-transpose-data-in-hive. It's kinda similar. Thanks! And sorry, @dubois, but I had to get Mark's attention. It seems like he might know how to help me with my question. Thanks! :) – CodingInCircles Sep 06 '13 at 23:43
Not sure if the UDF was available at the time the question is raised, but position of the match along with the data value can be extracted using posexplode- https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-posexplode – SrinR Nov 14 '16 at 09:22

Hive: converting a comma separated string to array for table generating function

1 Answers1

Linked