how to remove star * from string using regex in pyspark

Question

I just started PySpark, here is the task:

I have an input of:

I need to use a regex to remove punctuation and all leading or trailing space and underscore. output is all lowercase.

What I came up is not complete:

sentence = regexp_replace(trim(lower(column)), '\\*\s\W\s*\\*_', '')

and the result is:

How do I fix the regex here? I need to use regexp_replace here.

Thank you very much.

Try [`^[ \t_*]+|[ \t_*]+$`](https://regex101.com/r/qD0dE3/1) (or - if multiline is not on by default - `(?m)^[ \t_*]+|[ \t_*]+$`). If it does not work, please precise what exactly you need to remove and provide input/expected output samples. — Wiktor Stribiżew, Jul 21 '16 at 20:37
what is expected is: `hi you` and `no underscore` and `remove punctuation then spaces`, thanks — mdivk, Jul 23 '16 at 03:11
with these result, the same regex will make `" The Elephant's 4 cats. "` to `"the elephants 4 cats"` — mdivk, Jul 23 '16 at 03:12
1. Removes punctuation, changes to lower case, and strips leading and trailing spaces. 2. Only spaces, letters, and numbers should be retained. Other characters should should be eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after punctuation is removed. — mdivk, Jul 23 '16 at 03:15
It seems that you may use `^\W+|\W+$|[^\w\s]+|_`. The ^ and $ anchors must match line start/end. If the pattern must not overflow across lines. replace `\W+$` with `[^\w\n]+$` and the `^\W+` with `^[^\w\n]+`. — Wiktor Stribiżew, Jul 23 '16 at 08:35
I added an answer, please consider accepting. If my answer proved helpful, please also consider upvoting the answer. — Wiktor Stribiżew, Jul 23 '16 at 17:06

score 1 · Accepted Answer · answered Jul 23 '16 at 17:00

You may use

^\W+|\W+$|[^\w\s]+|_

The ^ and $ anchors must match line start/end.

If the pattern must not overflow across lines, replace \W+$ with [^\w\n]+$ and the ^\W+ pattern with ^[^\w\n]+:

^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_

See the regex demo.

Explanation:

^ - start of line (if multiline option is onby default, else, try adding (?m) at the pattern start)
[^\w\n]+ - 1 or more non-word chars (non-[a-zA-Z0-9_]) except a newline
| - or
[^\w\n]+$ - 1 or more non-word chars at the end of the line ($)
| - or
[^\w\s]+ - 1 or more non-word chars except any whitespace
| - or
_ - an underscore.

If you do not really care about Unicode (I used \w, \s that can be made Unicode aware), you may just use a shorter, more simple pattern:

^[^a-zA-Z\n]+|[^a-zA-Z\n]+$|[^a-zA-Z\s]+

See this regex demo.

score 0 · Answer 2 · edited May 23 '17 at 12:01

0

TL;DR: sentence = column.strip(' \t\n*+_')

If you want to remove characters only from the ends and don't care about unicode, then the basic string strip() function will let you pick characters to strip. It defaults to whitespace, but you can put in whatever you want.

If you want to remove within a string you are stuck with a regular expression or, if using byte strings or Python 2, maketrans.

You may like to look at this question as well.

edited May 23 '17 at 12:01

Community

1
1

answered Jul 21 '16 at 19:47

Charles Merriam

19,908
6
73
83

Thanks, it didn't produce what is expected. I should have mentioned this requires to use `regexp_replace` – mdivk Jul 21 '16 at 19:53

how to remove star * from string using regex in pyspark

2 Answers2