I'm using pig to analyze data loaded from Cassandra. One of the columns that I get is a string with product ids and product information in JSON format:
row | ... | items | ... 1 | ... | "[{"id":"1", "useless_info":"blah"}, {"id":"2", "useless_info":"bleh"}]" | ... 2 | ... | "[{"id":"3"}]" | ... . | . | . | .
Note that some of the rows will have additional stuff within the string, while others will only have id.
Anyways, what I need to do is to parse each "items" string and generate id numbers:
row | id | ... | 1 | 1 | ... | 1 | 2 | ... | 2 | 3 | ... | etc
From what I understand, there are no JSON parsers for Pig out there, only load and store functions (like elephantbird). Is it possible to do what I want with something like REGEX_EXTRACT or will I have to write my own UDF (or is there a better, prettier, and more clever way)?
Thanks in advance for all your help!
PS I'm using Pig 0.93