How do I query HTML embedded inside a string inside a JSON file with Apache Drill?

Question

I'm trying to use Apache Drill (for the first time) on a JSON file that looks like this:

{
    "Key1": {
      "htmltags": "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />"
    },
    "Key2": {
      "htmltags": "<htmltag attr1='kilo' /><htmltag attr2='lima' /><htmltag attr3='mike' />"
    },
    "Key3": {
      "htmltags": "<htmltag attr1='november' /><htmltag attr2='foxtrot' /><htmltag attr3='sierra' />"
    }
}

My initial query was the hello world of drill: SELECT * FROM DataFile.json, and returned me the columns Key1, Key2, Key3. They only had one row, and it contained the entry: "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />" [i.e., only the entry Key1.htmltags].

I have two questions:

Why was there only one row returned, when there were three differently valued entries for each key?
After using the KVGEN/FLATTEN functions to get at my strings inside "htmltags" above, is there a way to drill further into (analyse and extract data from) the HTML tags?

Can't validate the json you posted or get a result from Drill running your hello world query. Please check the json you used against the post. — catpaws, Nov 16 '15 at 02:22
@catpaws this was representative of the original, sorry I didn't check for validity. I'll correct it. — Aditya M P, Nov 16 '15 at 06:56

score 0 · Answer 1 · answered Nov 16 '15 at 17:31

0

The JSON seems to be not well formed. The objects are not clearly identified by a name/value pair. Nor is it a clear array.

Once that is resolved, the values for htmltags will have to be handled with string functions such as locate,substr,position, etc (See https://drill.apache.org/docs/string-manipulation/)

Best may be to have the htmltags as arrays vs just a string.

answered Nov 16 '15 at 17:31

Andries

56
2

I agree that the structure doesn't seem optimal, but this is what I was supplied to work with :( could you expand a little bit on the string handling function part with an example? – Aditya M P Nov 16 '15 at 17:33

score 0 · Accepted Answer · edited May 23 '17 at 11:44

0

Unfortunately, it looks like Drill isn't the right tool (v1.1.0 as of this writing on Homebrew) for the job.

It looks like there is a bug with the system which is the reason why there is only one row despite multiple columns. I've filed a report: https://issues.apache.org/jira/browse/DRILL-4102
I've scoured the documentation once again, there are no tools to analyse HTML or XML natively. Depending on string manipulation for this is not a task I relish.

Hence, I'll go with an XML parser, DOM tree crawler or the like, and use a bash string function to extract the target tag strings awk/tee.

edited May 23 '17 at 11:44

Community

1
1

answered Nov 17 '15 at 12:45

Aditya M P

5,127
7
41
72

There's some experimental support for XML as a data source here: https://github.com/magpierre/drill/tree/DRILL-3878/contrib/storage-xml Maybe this could help? – Chris Matta Nov 25 '15 at 20:59

How do I query HTML embedded inside a string inside a JSON file with Apache Drill?

2 Answers2