1

I have a program where I am writing a Pipe Delimited file using PySpark. I want to write the file using Ç - cedilla as the delimiter.

sample code

separator = '|'
concat_udf1 = F.udf(lambda cols: "".join([x+separator if x is not None else "separator" for x in cols]), StringType())

Current dataframe output

7|2020-03-31|xyz
7|2020-03-31|abc

New dataframe output

7Ç2020-03-31Çxyz
7Ç2020-03-31Çabc

If I am changing the separator to Ç - cedilla I get below error

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Any help would be appreciated - TIA

wjandrea
  • 28,235
  • 9
  • 60
  • 81
vivek
  • 11
  • 1
  • 1
    What throws this error? It's a classic case of improper encoding usage (and that's what the error is telling you as well). – DaveIdito Aug 03 '20 at 19:31

1 Answers1

0

This command on the terminal will work as intented:

< cedilla-dataframe-txt tr '\u00c7' '|'

Or instead of '\u00c7' can paste the cedilla character.

greybeard
  • 2,249
  • 8
  • 30
  • 66