2

I just started PySpark, here is the task:

I have an input of:

enter image description here

I need to use a regex to remove punctuation and all leading or trailing space and underscore. output is all lowercase.

What I came up is not complete:

sentence = regexp_replace(trim(lower(column)), '\\*\s\W\s*\\*_', '')

and the result is:

enter image description here

How do I fix the regex here? I need to use regexp_replace here.

Thank you very much.

mdivk
  • 3,545
  • 8
  • 53
  • 91
  • And what is the text? What is the expected result? – Wiktor Stribiżew Jul 21 '16 at 19:16
  • Try [`^[ \t_*]+|[ \t_*]+$`](https://regex101.com/r/qD0dE3/1) (or - if multiline is not on by default - `(?m)^[ \t_*]+|[ \t_*]+$`). If it does not work, please precise what exactly you need to remove and provide input/expected output samples. – Wiktor Stribiżew Jul 21 '16 at 20:37
  • what is expected is: `hi you` and `no underscore` and `remove punctuation then spaces`, thanks – mdivk Jul 23 '16 at 03:11
  • with these result, the same regex will make `" The Elephant's 4 cats. "` to `"the elephants 4 cats"` – mdivk Jul 23 '16 at 03:12
  • 1. Removes punctuation, changes to lower case, and strips leading and trailing spaces. 2. Only spaces, letters, and numbers should be retained. Other characters should should be eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after punctuation is removed. – mdivk Jul 23 '16 at 03:15
  • 1
    It seems that you may use `^\W+|\W+$|[^\w\s]+|_`. The ^ and $ anchors must match line start/end. If the pattern must not overflow across lines. replace `\W+$` with `[^\w\n]+$` and the `^\W+` with `^[^\w\n]+`. – Wiktor Stribiżew Jul 23 '16 at 08:35
  • Thank you, that works amazingly – mdivk Jul 23 '16 at 16:54
  • I added an answer, please consider accepting. If my answer proved helpful, please also consider upvoting the answer. – Wiktor Stribiżew Jul 23 '16 at 17:06

2 Answers2

1

You may use

^\W+|\W+$|[^\w\s]+|_

The ^ and $ anchors must match line start/end.

If the pattern must not overflow across lines, replace \W+$ with [^\w\n]+$ and the ^\W+ pattern with ^[^\w\n]+:

^[^\w\n]+|[^\w\n]+$|[^\w\s]+|_

See the regex demo.

Explanation:

  • ^ - start of line (if multiline option is onby default, else, try adding (?m) at the pattern start)
  • [^\w\n]+ - 1 or more non-word chars (non-[a-zA-Z0-9_]) except a newline
  • | - or
  • [^\w\n]+$ - 1 or more non-word chars at the end of the line ($)
  • | - or
  • [^\w\s]+ - 1 or more non-word chars except any whitespace
  • | - or
  • _ - an underscore.

If you do not really care about Unicode (I used \w, \s that can be made Unicode aware), you may just use a shorter, more simple pattern:

^[^a-zA-Z\n]+|[^a-zA-Z\n]+$|[^a-zA-Z\s]+

See this regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

TL;DR: sentence = column.strip(' \t\n*+_')

If you want to remove characters only from the ends and don't care about unicode, then the basic string strip() function will let you pick characters to strip. It defaults to whitespace, but you can put in whatever you want.

If you want to remove within a string you are stuck with a regular expression or, if using byte strings or Python 2, maketrans.

You may like to look at this question as well.

Community
  • 1
  • 1
Charles Merriam
  • 19,908
  • 6
  • 73
  • 83
  • Thanks, it didn't produce what is expected. I should have mentioned this requires to use `regexp_replace` – mdivk Jul 21 '16 at 19:53