2

I'm new to Zeppelin and I there are things I just don't understand.

I have downloaded a table from a db with python, then, I would like to convert it to an RDD. But I got the error that the table is not found. I think there's a problem founding the tables created with another interpreters but I don't realy know... I tried with this and this question but still don't work, they create the df directly with spark. Any help would be so useful :)

 %python
    engine = create_engine(
        'mysql+mysqlconnector://...')
    df = pd.read_sql(query, engine)

%spark
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.implicits._
df=df.registerTempTable("df")

val df = sqlContext.sql("SELECT * from df LIMIT 5")
df.collect().foreach(println)
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
Cl4u
  • 35
  • 7
  • if you want to use registered temp tables across paragraphs, I've found that you need to use the pre-initialized sql context. For me, that was "sqlc" (although I believe sqlContext is also valid). But don't create your own like you have above. I was importing some existing python code that had a sql context called "sq", so I simply did: sq=sqlc and that was it. – tbone Jun 09 '17 at 15:41

1 Answers1

3

Converting a Pandas DataFrame to a Spark DataFrame is quite straight-forward :

%python
import pandas

pdf = pandas.DataFrame([[1, 2]]) # this is a dummy dataframe

# convert your pandas dataframe to a spark dataframe
df = sqlContext.createDataFrame(pdf)

# you can register the table to use it across interpreters
df.registerTempTable("df")

# you can get the underlying RDD without changing the interpreter 
rdd = df.rdd

To fetch it with scala spark you'll just need to do the following :

%spark
val df = sqlContext.sql("select * from df")
df.show()
// +---+---+
// |  0|  1|
// +---+---+
// |  1|  2|
// +---+---+

You can also get the underlying rdd :

val rdd = df.rdd
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • Oh thank you!! How did you import the SQL Context function? I tried from pyspark.sql import * but there's no module named pyspark, and I can't find it with pip, it seems an only Spark thing – Cl4u Jun 09 '17 at 10:14
  • I'm not sure I get your question. Normally zeppelin picks up pyspark from the $SPARK_HOME. Also I don't see why you'd want to import anything for this. This is a complete functional example. – eliasah Jun 09 '17 at 11:42
  • beacuse I got the error "NameError: name 'sqlContext' is not defined " :( – Cl4u Jun 09 '17 at 12:07
  • What versions of zeppelin and spark are you using ? What kind of cluster are you working on ? – eliasah Jun 09 '17 at 12:12
  • Spark 2.1.0, zeppelin 0.7.1. I'm using it at host from a docker https://hub.docker.com/r/claudiarivera/zeppelin/ – Cl4u Jun 09 '17 at 12:32
  • This is another issue then. Maybe the docker image isn't capable of creating the zeppelin context with the sql context.. – eliasah Jun 09 '17 at 12:44
  • this is another docker image you can try https://github.com/dylanmei/docker-zeppelin/issues – eliasah Jun 09 '17 at 14:55