14

How can I read a csv at a url into a dataframe in Pyspark without writing it to disk?

I've tried the following with no luck:

import urllib.request
from io import StringIO

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()      
text = data.decode('utf-8')  


f = StringIO(text)

df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
RobinL
  • 11,009
  • 8
  • 48
  • 68
  • Interesting, you mean these data are not maintained in memory, there is some sort of caching to temporary disk? – sAguinaga Mar 24 '18 at 12:54

1 Answers1

12

TL;DR It is not possible and in general transferring data through driver is a dead-end.

  • Before Spark 2.3 csv reader can read only from URI (and http is not supported).
  • In Spark 2.3 you use RDD:

    spark.read.csv(sc.parallelize(text.splitlines()))
    

    but data will be written to disk.

  • You can createDataFrame from Pandas:

    spark.createDataFrame(pd.read_csv(url)))
    

    but this once again writes to disk

If file is small I'd just use sparkFiles:

from pyspark import SparkFiles

spark.sparkContext.addFile(url)

spark.read.csv(SparkFiles.get("iris.csv"), header=True))
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • 1
    Should I assume that reading a CSV from URL, such as in the case of `spark.createDataFrame(pd.read_csv(url)))`, is going to write the entire file to disk as it downloads it? Ultimately, would you mind speaking to the overall best case on storage and memory usage one can achieve with spark, pandas, and iterating through CSV files? I'm happy to ask a new question if desired. – ThatsAMorais Aug 26 '18 at 06:35