I have a delimited .txt files in AWS s3. The datan is delimited with þ
839729þ25þad@xxx.comþfirstnameþlastnameþ0þBLACKþ28/08/2017þ12329038þ99þ287þ81þ0
I tried using databricks and sparkcontext
to import the data. While databricks approach ran and didn't throw an error there was no data in the dataframe. The spark context just threw an error saying - Cannot run multiple SparkContexts at once.
Below is the code for the 2 appraoches that i tried:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
path = "s3:/XXX.txt"
df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("sep","þ").load(path).distinct().cache()
2nd approach
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)
path = "s3://XXX.txt"
input = sc.textFile(path).map(lambda x: x.split('þ'))
For the first approach while there is no data, it is reading the first row in the raw data as the header because for df.show(10, False)
i got the following output:
|��839729�%25�%zulekhasaiyad@yahoo.com�%Zulekha�%Ali�%0�%Blue�%28/08/2017�%329559038�%12�%128932287�%3081�%0|
I am completely new to Spark and by extension PySpark, so please go easy on me :) thanks.