PySpark Replace Characters using regex and remove column on Databricks

Question

I am tring to remove a column and special characters from the dataframe shown below.

The code below used to create the dataframe is as follows:

dt = pd.read_csv(StringIO(response.text), delimiter="|", encoding='utf-8-sig')

The above produces the following output:

I need help with regex to remove the characters Ã¯Â»Â¿ and delete the first column.

As regards regex, I have tried the following:

dt.withColumn('Ã¯Â»Â¿COUNTRY ID', regexp_replace('Ã¯Â»Â¿COUNTRY ID', @"[^0-9a-zA-Z_]+"_ ""))

However, I'm getting a syntax error.

Any help much appreciated.

regexp replace is pyspark function for rows and it looks like you are using it for pandas dataframe for column name replace — Anjaneya Tripathi, Jun 13 '22 at 16:35
Anjaneya, thanks for reaching out. I see what you mean, I was hoping someone could help with regex for Pandas. Can you still help with Regex for Pandas? If not can you let me know how to get the result with PySpark. I hope I'm making sense — Patterson, Jun 13 '22 at 19:21
Alternatively, can you let me know what the alternative is in pyspark for ```dt = pd.read_csv(StringIO(response.text),sep="|", encoding='base64')``` — Patterson, Jun 13 '22 at 19:27
My guess is ```dt = spark.read.csv(StringIO(response.text),sep="|", encoding='base64')``` — Patterson, Jun 13 '22 at 19:27
I don't think it will work as read.csv expects a path to be provided but if you want some discussion without cluttering comments we can try [here](https://chat.stackoverflow.com/rooms/245568/spark-discussion) — Anjaneya Tripathi, Jun 13 '22 at 19:32
I think I have probably confused the community with my request in this question. So, I will submit a new question. — Patterson, Jun 13 '22 at 19:44
Could you post the output of `response.text`? and is the response is from requests module? — Emma, Jun 13 '22 at 22:40

score 1 · Accepted Answer · answered Jun 13 '22 at 21:35

If the position of incoming column is fixed you can use regex to remove extra characters from column name like below


import re

colname = pdf.columns[0]
colt=re.sub("[^0-9a-zA-Z_\s]+","",colname)
print(colname,colt)
pdf.rename(columns={colname:colt}, inplace = True)

And for dropping index column you can refer to this stack answer

score 0 · Answer 2 · answered Jun 13 '22 at 22:40

You have read in the data as a pandas dataframe. From what I see, you want a spark dataframe. Convert from pandas to spark and rename columns. That will dropn pandas default index column which in your case you refer to as first column. You then can rename the columns. Code below

df=spark.createDataFrame(df).toDF('COUNTRY',' COUNTRY NAME').show()

PySpark Replace Characters using regex and remove column on Databricks

2 Answers2