1

I have the following bad formatted txt file:

id;text;contact_id
1;Reason contact\
\
The client was not satisfied about the quality of the product\
\
;c_102932131

I'm trying to load the file using pyspark by using:

df = sc.read\
.option("delimiter", ";")\
.option("header", "true")\
.option("inferSchema", "true")\
.option("multiLine", "true")\
.option("wholeFile", "true")\
.csv(os.path.join(appconfig.configs[appconfig.ENV]["ROOT_DIR"], "data", "input", file_name))

But the column text is truncated, since the dataframe is:

id|text|contact_id
1|Reason contact|null
null|null|c_102932131

So I lose all the other lines. The goal is to read correctly the file in this way:

id|text|contact_id
1|Reason contact The client was satisfied not about the quality of the product|c_102932131

How can I do this? Thank you

br1
  • 357
  • 1
  • 5
  • 19

1 Answers1

2

Use .wholeTextFiles and then replace new line (\n) and \ finally create df.

Example:

Spark-Scala:

sc.wholeTextFiles("<file_path>").
toDF().
selectExpr("""split(replace(regexp_replace(_2,"[\\\\|\n]",""),"id;text;contact_id",""),";") as new""").
withColumn("id",col("new")(0)).
withColumn("text",col("new")(1)).
withColumn("contact_id",col("new")(2)).
drop("new").
show(false)
//+---+---------------------------------------------------------------------------+-----------+
//|id |text                                                                       |contact_id |
//+---+---------------------------------------------------------------------------+-----------+
//|1  |Reason contactThe client was not satisfied about the quality of the product|c_102932131|
//+---+---------------------------------------------------------------------------+-----------+

Pyspark:

from pyspark.sql.functions import *

sc.wholeTextFiles("<file_path>").\
toDF().\
selectExpr("""split(replace(regexp_replace(_2,'[\\\\\\\\|\n]',''),"id;text;contact_id",""),";") as new""").\
withColumn("id",col("new")[0]).\
withColumn("text",col("new")[1]).\
withColumn("contact_id",col("new")[2]).\
drop("new").\
show(10,False)
#+---+---------------------------------------------------------------------------+-----------+
#|id |text                                                                       |contact_id |
#+---+---------------------------------------------------------------------------+-----------+
#|1  |Reason contactThe client was not satisfied about the quality of the product|c_102932131|
#+---+---------------------------------------------------------------------------+-----------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • Hi, thanks for your answer. Can you explain the function selectExpr as you wrote in the example? What is _2? Refers to the column? What is the last ""? Thank you – br1 Apr 26 '20 at 20:09
  • @br1, Yes,`_2` is the column name after converting to `dataframe` and `selectExpr` is used to execute set of sql expressions http://spark.apache.org/docs/1.5.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.selectExpr and `"""` treat as `string` the whole expression in selectExpr. – notNull Apr 26 '20 at 20:32
  • thank your for your answer. It is actually working! – br1 Apr 26 '20 at 21:06
  • Hi, your solution works only with one line. If you have multiple lines, your solution splits everything to only one element. This is not correct. Maybe the problem is in the split function. – br1 May 01 '20 at 16:12