38

I am using two Jupyter notebooks to do different things in an analysis. In my Scala notebook, I write some of my cleaned data to parquet:

partitionedDF.select("noStopWords","lowerText","prediction").write.save("swift2d://xxxx.keystone/commentClusters.parquet")

I then go to my Python notebook to read in the data:

df = spark.read.load("swift2d://xxxx.keystone/commentClusters.parquet")

and I get the following error:

AnalysisException: u'Unable to infer schema for ParquetFormat at swift2d://RedditTextAnalysis.keystone/commentClusters.parquet. It must be specified manually;'

I have looked at the spark documentation and I don't think I should be required to specify a schema. Has anyone run into something like this? Should I be doing something else when I save/load? The data is landing in Object Storage.

edit: I'm sing spark 2.0 in both the read and the write.

edit2: This was done in a project in Data Science Experience.

Jeril
  • 7,858
  • 3
  • 52
  • 69
Ross Lewis
  • 755
  • 2
  • 7
  • 17
  • 2
    Here is a [gist](https://gist.github.com/jtyberg/9f8480724634c764d3c73c8e989fa8f9) to write/read a DataFrame as a parquet file to/from Swift. It's using a simple schema (all "string" types). What is the schema for your DataFrame? Spark tries to infer the schema, but "Currently, numeric data types and string type are supported" (see http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery) – jtyberg Mar 24 '17 at 18:10
  • 1
    I believe you answered my question then! The column "noStopWords" is a vector of words. How do I save/load a df with this column? – Ross Lewis Mar 24 '17 at 18:36

2 Answers2

60

I use the following two ways to read the parquet file:

Initialize Spark Session:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master('local') \
    .appName('myAppName') \
    .config('spark.executor.memory', '5gb') \
    .config("spark.cores.max", "6") \
    .getOrCreate()

Method 1:

df = spark.read.parquet('path-to-file/commentClusters.parquet')

Method 2:

sc = spark.sparkContext

# using SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# read parquet file
df = sqlContext.read.parquet('path-to-file/commentClusters.parquet')
Jeril
  • 7,858
  • 3
  • 52
  • 69
22

You can use parquet format of Spark Session to read parquet files. Like this:

df = spark.read.parquet("swift2d://xxxx.keystone/commentClusters.parquet")

Although, there is no difference between parquet and load functions. It might be the case that load is not able to infer the schema of data in the file (eg, some data type which is not identifiable by load or specific to parquet).

himanshuIIITian
  • 5,985
  • 6
  • 50
  • 70
  • 1
    Thank you for the feedback, but this ended up with the same error. I'll keep trying other things. – Ross Lewis Mar 24 '17 at 16:43
  • 1
    There is a tutorial for that here: http://datascience.ibm.com/blog/upload-files-to-ibm-data-science-experience-using-the-command-line-2/ – aruizga Apr 17 '17 at 19:46