I have some tab separated data on s3 in a directory s3://mybucket/my/directory/
.
Now, I am telling pyspark that I want to use \t
as the delimiter to read in just one file like this:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext, Row
from pyspark.sql.types import *
from datetime import datetime
from pyspark.sql.functions import col, date_sub, log, mean, to_date, udf, unix_timestamp
from pyspark.sql.window import Window
from pyspark.sql import DataFrame
sc =SparkContext()
sc.setLogLevel("DEBUG")
sqlContext = SQLContext(sc)
indata_creds = sqlContext.read.load('s3://mybucket/my/directory/onefile.txt').option("delimiter", "\t")
But it is telling me: assertion failed: No predefined schema found, and no Parquet data files or summary files found under s3://mybucket/my/directory/onefile.txt
How do I tell pyspark that this is a tab-delimited file and not a parquet file?
Or, is there an easier way to do read in these files in the entire directory all at once?
thanks.
- EDIT: I am using pyspark version 1.6.1 *
The files are on s3, so I am not able to use the usual:
indata_creds = sqlContext.read.text('s3://mybucket/my/directory/')
because when I try that, I get java.io.IOException: No input paths specified in job
Anything else I can try?