4

I'm using IPython in a Spark/Bluemix environment

I have a csv uploaded to the the object store and I can read it ok using sc.textfile but I get file does not exist when I use pandas pd.read_csv

  1. data = sc.textFile("swift://notebooks.books/rtenews.csv")

  2. import pandas as pd data = pd.read_csv('swift://notebooks.books/rtenews.csv')

IOError File swift://notebooks.books/rtenews.csv does not exist

Why is this? How can I read a csv file to a pandas dataframe?

tijko
  • 7,599
  • 11
  • 44
  • 64
subiman
  • 43
  • 5
  • Pandas reader supports only local filesystems. Why do you need this? – zero323 Dec 30 '15 at 20:38
  • It relates to a big data analytics course project and demonstrating the use of Spark/Bluemix and map/reduce is a requirement. Even though the file starts out on a local file system - I have to process it in Spark/Ipython. – subiman Dec 30 '15 at 21:07
  • Just skip Pandas and load data directly to Spark: http://stackoverflow.com/q/28782940/1560062 – zero323 Dec 30 '15 at 21:09
  • But then I miss out on all the pre canned data analysis capabilities of pandas dataframes – subiman Dec 30 '15 at 21:15
  • Spark and Pandas are two completely different worlds. If your requirement is Spark and distributed processing then Pandas won't work. Running Pandas in the same interpreter doesn't make it distributed. – zero323 Dec 30 '15 at 21:20
  • I managed to make some progress with .toDF() on the file loaded with sc.textfile. I can do some distributed processing on this and then convert to a Spark dataframe and then toPandas() which has me back in familiar territory. I take your point about the two different purposes. Thanks – subiman Dec 30 '15 at 21:54

1 Answers1

3

Once you have uploaded the CSV file to your Bluemix Object Storage, you can read the CSV file using Spark directly:

data = sc.textFile("swift://notebooks.books/rtenews.csv")

This is possible, because configurations have been done to enable this feature.

If you try to read the CSV file with the following code using pandas:

import pandas as pd 
data = pd.read_csv('swift://notebooks.books/rtenews.csv')

This will not work, because pandas do not support direct access of Bluemix Object Storage. Have a look at the API documentation of pandas.read_csv(): http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html It supports a few valid URL schemes, only.

However, it is possible to read a CSV file on you Bluemix Object Storage as StringIO object into pandas.DataFrame.

You can find the instructions in "Precipitation Analysis" sample notebook:

Use this approach not for large CSV files!

Sven Hafeneger
  • 801
  • 6
  • 13