How to import a 'þ' delimited .txt file in pyspark

Question

I have a delimited .txt files in AWS s3. The datan is delimited with þ 839729þ25þad@xxx.comþfirstnameþlastnameþ0þBLACKþ28/08/2017þ12329038þ99þ287þ81þ0

I tried using databricks and sparkcontext to import the data. While databricks approach ran and didn't throw an error there was no data in the dataframe. The spark context just threw an error saying - Cannot run multiple SparkContexts at once.

Below is the code for the 2 appraoches that i tried:

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

path = "s3:/XXX.txt"
df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("sep","þ").load(path).distinct().cache()

2nd approach

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

conf = SparkConf().setMaster("local").setAppName("test")
sc = SparkContext(conf = conf)

path = "s3://XXX.txt"
input = sc.textFile(path).map(lambda x: x.split('þ'))

For the first approach while there is no data, it is reading the first row in the raw data as the header because for df.show(10, False) i got the following output:

|��839729�%25�%zulekhasaiyad@yahoo.com�%Zulekha�%Ali�%0�%Blue�%28/08/2017�%329559038�%12�%128932287�%3081�%0|

I am completely new to Spark and by extension PySpark, so please go easy on me :) thanks.

Did you try using the hex code for that character? And you can't use 2 Contexts... Use the getOrCreate function to fix that — OneCricketeer, Sep 01 '17 at 03:09
Also, are you only seeing an encoding problem in the data? Are you sure that's the actual delimiter? If all else fails, find out what the raw bytes of the delimiter actually are — OneCricketeer, Sep 01 '17 at 03:12
This data is also present in a SQL server and there the delimiter used to split the data is the same. How do i use the hex code that you have mentioned. I am new python and spark? Thanks. — Raj, Sep 01 '17 at 04:08
This problem has been solved in Scala https://stackoverflow.com/questions/36007686/how-to-parse-a-csv-that-uses-a-i-e-001-as-the-delimiter-with-spark-csv — MaFF, Sep 01 '17 at 06:30

score 0 · Answer 1 · answered Aug 31 '17 at 23:07

0

The correct option is delimiter not sep:

...
    .option("delimiter", "þ")

answered Aug 31 '17 at 23:07

user8545651

1

There was no difference in the results with either `sep` or `delimiter`. The output was the same in both cases. I have included the output in the question above – Raj Aug 31 '17 at 23:56
Use `delimeter` option with unicode `\u` encoding of the character for Scala, `\x` for pyspark – MaFF Sep 01 '17 at 06:33

score 0 · Answer 2 · answered Sep 01 '17 at 06:57

0

You should use option delimeter and hex escape the special character:

df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimeter","\xc3\xbe").load(path).distinct().cache()

answered Sep 01 '17 at 06:57

MaFF

9,551
2
32
41

How to import a 'þ' delimited .txt file in pyspark

2 Answers2