How to read tabular data on s3 in pyspark?

Question

I have some tab separated data on s3 in a directory s3://mybucket/my/directory/.

Now, I am telling pyspark that I want to use \t as the delimiter to read in just one file like this:

from pyspark import SparkContext

from pyspark.sql import HiveContext, SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions import col, date_sub, log, mean, to_date, udf, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql import DataFrame

sc =SparkContext()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
indata_creds = sqlContext.read.load('s3://mybucket/my/directory/onefile.txt').option("delimiter", "\t")

But it is telling me: assertion failed: No predefined schema found, and no Parquet data files or summary files found under s3://mybucket/my/directory/onefile.txt

How do I tell pyspark that this is a tab-delimited file and not a parquet file?

Or, is there an easier way to do read in these files in the entire directory all at once?

thanks.

EDIT: I am using pyspark version 1.6.1 *

The files are on s3, so I am not able to use the usual:

indata_creds = sqlContext.read.text('s3://mybucket/my/directory/')

because when I try that, I get java.io.IOException: No input paths specified in job

Anything else I can try?

Dat Tran · Answer 1 · 2017-07-17T14:54:45.417

4

Since you're using Apache Spark 1.6.1, you need spark-csv to use this code:

indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/onefile.txt')

That should work!

Another option is for example this answer. Instead of splitting this by the comma you could use to split it by tabs. And then load the RDD into a dataframe. However, the first option is easier and already loads it into a dataframe.

For your alternative in your comment, I wouldn't convert it to parquet files. There is no need for it except if your data is really huge and compression is necessary.

For your second question in the comment, yes it is possible to read the entire directory. Spark supports regex/glob. So you could do something like this:

indata_creds = sqlContext.read.format('com.databricks.spark.csv').option('delimiter', '\t').load('s3://mybucket/my/directory/*.txt')

By the way, why are you not using 2.x.x? It's also available on aws.

edited Jul 17 '17 at 14:54

answered Jul 17 '17 at 08:24

Dat Tran

2,368
18
25

NO sorry, that does not work. First of all, I get `AttributeError: 'DataFrameReader' object has no attribute 'csv'` with your code above. And, when I try to do `indata_creds = spark_session.read.option('sep', '\t').load('s3://mybucket/my/directory/onefile.txt')` I get the same error that I posted about parquet files. – makansij Jul 17 '17 at 14:20
There could be other alternatives: 1) Is there a way to compress all of these text files into a few parquet files on s3? 2) Is there a way to read in the entire directory all at once? – makansij Jul 17 '17 at 14:26
that first option you posted still just does not work. I get `no input paths specified`, but when I check s3 the file is clearly there. – makansij Jul 17 '17 at 15:53
Did you download the jar and put it on your EC2 instance? That solution should work. Otherwise you're just doing it wrong. – Dat Tran Jul 18 '17 at 06:55

score 3 · Accepted Answer · answered Aug 01 '17 at 02:01

3

The actual problem was that I needed to add my AWS keys to my spark-env.sh file.

answered Aug 01 '17 at 02:01

makansij

9,303
37
105
183

How to read tabular data on s3 in pyspark?

2 Answers2