9

I am executing a statement in Livy Server using HTTP POST call to localhost:8998/sessions/0/statements, with the following body

{
  "code": "spark.sql(\"select * from test_table limit 10\")"
}

I would like an answer in the following format

(...)
"data": {
  "application/json": "[
    {"id": "123", "init_date": 1481649345, ...},
    {"id": "133", "init_date": 1481649333, ...},
    {"id": "155", "init_date": 1481642153, ...},
  ]"
}
(...)

but what I'm getting is

(...)
"data": {
  "text/plain": "res0: org.apache.spark.sql.DataFrame = [id: string, init_date: timestamp ... 64 more fields]"
}
(...)

Which is the toString() version of the dataframe.

Is there some way to return a dataframe as JSON using the Livy Server?

EDIT

Found a JIRA issue that addresses the problem: https://issues.cloudera.org/browse/LIVY-72

By the comments one can say that Livy does not and will not support such feature?

matheusr
  • 567
  • 9
  • 29

3 Answers3

4

I recommend using the built-in (albeit hard to find documentation for) magics %json and %table:

%json

session_url = host + "/sessions/1"
statements_url = session_url + '/statements'
data = {
        'code': textwrap.dedent("""\
        val d = spark.sql("SELECT COUNT(DISTINCT food_item) FROM food_item_tbl")
        val e = d.collect
        %json e
        """)}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()

%table

session_url = host + "/sessions/21"
statements_url = session_url + '/statements'
data = {
        'code': textwrap.dedent("""\
        val x = List((1, "a", 0.12), (3, "b", 0.63))
        %table x
        """)}
r = requests.post(statements_url, data=json.dumps(data), headers=headers)
print r.json()

Related: Apache Livy: query Spark SQL via REST: possible?

Garren S
  • 5,552
  • 3
  • 30
  • 45
  • these statements needs to be part of application where I am calling or it can be saved at server side and can be retrieved later.Which is preferred way? – Utkarsh Saraf Oct 17 '17 at 07:35
  • 1
    These magics (%json and %table) are only useful when called from the application. Caching the data frame(s) that you ultimately use to derive the final results in your Livy session would likely be very wise – Garren S Oct 17 '17 at 14:24
  • It means if I am running a spark job via livy and use cache() option for any dataframe,i can take that dataframe in magics(%json and %table) to return to client. – Utkarsh Saraf Oct 18 '17 at 07:02
  • Yes @UtkarshSaraf that is correct. I believe (based on my own code example above) that you need to collect the results first to a list rather than directly call the magics on a spark dataframe. – Garren S Oct 18 '17 at 18:36
  • I have been able to execute above code.I saw that when I `GET` then only I have been able to get the json data. Is it not possible to link that output of statement directly to batch and get results excluding extra sentence of "GET" curl – Utkarsh Saraf Oct 25 '17 at 10:15
  • You should probably mention that this is Python code not native Spark. Do you have any reference for the magics? – HansHarhoff Jun 27 '19 at 19:02
3

I don't have a lot of experience with Livy, but as far as I know this endpoint is used as an interactive shell and the output will be a string with the actual result that would be shown by a shell. So, with that in mind, I can think of a way to emulate the result you want, but It may not be the best way to do it:

{
  "code": "println(spark.sql(\"select * from test_table limit 10\").toJSON.collect.mkString(\"[\", \",\", \"]\"))"
}

Then, you will have a JSON wrapped in a string, so your client could parse it.

Daniel de Paula
  • 17,362
  • 9
  • 71
  • 72
  • 1
    That was it! According to the [JIRA issue](https://issues.cloudera.org/browse/LIVY-72), Livy wasn't actually meant to do what I wanted, but your solution works perfectly, thanks! – matheusr Dec 14 '16 at 12:39
0

I think in general your best bet is to write your output to a database of some kind. If you write to a randomly named table, you could have your code read it after the script is done.

coding
  • 645
  • 6
  • 7
  • I think it depends on the size of the dataframe. I've run into issues where even though I tried to send the data back as JSON, Livy ran out of memory receiving it. (Spark job master/executors were fine on memory, Livy ran out. Probably adjustable.) – Peter Mar 15 '19 at 13:58