0

I'm trying to use Apache Drill (for the first time) on a JSON file that looks like this:

{
    "Key1": {
      "htmltags": "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />"
    },
    "Key2": {
      "htmltags": "<htmltag attr1='kilo' /><htmltag attr2='lima' /><htmltag attr3='mike' />"
    },
    "Key3": {
      "htmltags": "<htmltag attr1='november' /><htmltag attr2='foxtrot' /><htmltag attr3='sierra' />"
    }
}

My initial query was the hello world of drill: SELECT * FROM DataFile.json, and returned me the columns Key1, Key2, Key3. They only had one row, and it contained the entry: "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />" [i.e., only the entry Key1.htmltags].

I have two questions:

  1. Why was there only one row returned, when there were three differently valued entries for each key?
  2. After using the KVGEN/FLATTEN functions to get at my strings inside "htmltags" above, is there a way to drill further into (analyse and extract data from) the HTML tags?
Aditya M P
  • 5,127
  • 7
  • 41
  • 72

2 Answers2

0

The JSON seems to be not well formed. The objects are not clearly identified by a name/value pair. Nor is it a clear array.

Once that is resolved, the values for htmltags will have to be handled with string functions such as locate,substr,position, etc (See https://drill.apache.org/docs/string-manipulation/)

Best may be to have the htmltags as arrays vs just a string.

Andries
  • 56
  • 2
  • I agree that the structure doesn't seem optimal, but this is what I was supplied to work with :( could you expand a little bit on the string handling function part with an example? – Aditya M P Nov 16 '15 at 17:33
0

Unfortunately, it looks like Drill isn't the right tool (v1.1.0 as of this writing on Homebrew) for the job.

  1. It looks like there is a bug with the system which is the reason why there is only one row despite multiple columns. I've filed a report: https://issues.apache.org/jira/browse/DRILL-4102
  2. I've scoured the documentation once again, there are no tools to analyse HTML or XML natively. Depending on string manipulation for this is not a task I relish.

Hence, I'll go with an XML parser, DOM tree crawler or the like, and use a bash string function to extract the target tag strings awk/tee.

Community
  • 1
  • 1
Aditya M P
  • 5,127
  • 7
  • 41
  • 72
  • There's some experimental support for XML as a data source here: https://github.com/magpierre/drill/tree/DRILL-3878/contrib/storage-xml Maybe this could help? – Chris Matta Nov 25 '15 at 20:59