Convert pandas dataframe to RDD in zeppelin

Question

I'm new to Zeppelin and I there are things I just don't understand.

I have downloaded a table from a db with python, then, I would like to convert it to an RDD. But I got the error that the table is not found. I think there's a problem founding the tables created with another interpreters but I don't realy know... I tried with this and this question but still don't work, they create the df directly with spark. Any help would be so useful :)

 %python
    engine = create_engine(
        'mysql+mysqlconnector://...')
    df = pd.read_sql(query, engine)

%spark
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._
df=df.registerTempTable("df")

val df = sqlContext.sql("SELECT * from df LIMIT 5")
df.collect().foreach(println)

if you want to use registered temp tables across paragraphs, I've found that you need to use the pre-initialized sql context. For me, that was "sqlc" (although I believe sqlContext is also valid). But don't create your own like you have above. I was importing some existing python code that had a sql context called "sq", so I simply did: sq=sqlc and that was it. — tbone, Jun 09 '17 at 15:41

score 3 · Accepted Answer · answered Jun 09 '17 at 09:00

3

Converting a Pandas DataFrame to a Spark DataFrame is quite straight-forward :

%python
import pandas

pdf = pandas.DataFrame([[1, 2]]) # this is a dummy dataframe

# convert your pandas dataframe to a spark dataframe
df = sqlContext.createDataFrame(pdf)

# you can register the table to use it across interpreters
df.registerTempTable("df")

# you can get the underlying RDD without changing the interpreter 
rdd = df.rdd

To fetch it with scala spark you'll just need to do the following :

%spark
val df = sqlContext.sql("select * from df")
df.show()
// +---+---+
// |  0|  1|
// +---+---+
// |  1|  2|
// +---+---+

You can also get the underlying rdd :

val rdd = df.rdd

answered Jun 09 '17 at 09:00

eliasah

39,588
11
124
154

Oh thank you!! How did you import the SQL Context function? I tried from pyspark.sql import * but there's no module named pyspark, and I can't find it with pip, it seems an only Spark thing – Cl4u Jun 09 '17 at 10:14
I'm not sure I get your question. Normally zeppelin picks up pyspark from the $SPARK_HOME. Also I don't see why you'd want to import anything for this. This is a complete functional example. – eliasah Jun 09 '17 at 11:42
beacuse I got the error "NameError: name 'sqlContext' is not defined " :( – Cl4u Jun 09 '17 at 12:07
What versions of zeppelin and spark are you using ? What kind of cluster are you working on ? – eliasah Jun 09 '17 at 12:12
Spark 2.1.0, zeppelin 0.7.1. I'm using it at host from a docker https://hub.docker.com/r/claudiarivera/zeppelin/ – Cl4u Jun 09 '17 at 12:32
This is another issue then. Maybe the docker image isn't capable of creating the zeppelin context with the sql context.. – eliasah Jun 09 '17 at 12:44
this is another docker image you can try https://github.com/dylanmei/docker-zeppelin/issues – eliasah Jun 09 '17 at 14:55

Convert pandas dataframe to RDD in zeppelin

1 Answers1