regular expressions in python need to retain special characters

Question

Below is my unclean text string

text = 'this/r/n/r/nis a non-U.S disclosures/n/n/r/r analysis agreements disclaimer./r/n/n/nPlease keep it confidential'

below is the regexp i'm using:

 ' '.join(re.findall(r'\b(\w+)\b', text))

my output is:

'this is a non US disclosures analysis agreements disclaimer. Please keep it confidential'

my expected output is:

 'this is a non-U.S disclosures analysis agreements disclaimer. Please keep it confidential'

I need to retain special characters and space between the words, there should be exactly one space. can anyone help me to alter my regexp?

Can you provide a valid Python string literal. The one you posted will raise a SyntaxError. — user2390182, Jan 29 '18 at 10:16
Are the `'/r/n/n'` in there like that or are these newlines and carriage returns like `'\r\n\n'`? — user2390182, Jan 29 '18 at 10:29
I do not believe your regex throws away punctuation (your 'output'), nor would it magically *add* some (your 'expected output'). — Jongware, Jan 29 '18 at 10:30
You guys are being sticklers for detail instead of answering the question which is quite clear in my opinion — Veltzer Doron, Jan 29 '18 at 11:01
@VeltzerDoron: if the OP feels the need to lie about such trivial detail such as "this is my output", then we cannot be sure about anything else either. — Jongware, Jan 29 '18 at 11:24

score 1 · Answer 1 · answered Feb 02 '18 at 05:23

1

Hope this works for you!

str = 'this/r/n/r/nis a non-U.S disclosures/n/n/r/r analysis agreements disclaimer./r/n/n/nPlease keep it confidential'

val = re.sub(r'(/.?)', " ", str); val1 = re.sub(r'\s+', " ", val) print(val1)

answered Feb 02 '18 at 05:23

Ajay

81
1
2

score 0 · Answer 2 · answered Jan 29 '18 at 11:00

0

Use a more specific word barrier than \b ($ which marks the end of a string can't be placed inside square brackets so you have to make the or explicit in $|\n|\r| and the ?= is a non consuming look ahead much like \b), also safer here is using a non greedy non empty accumulator (the + sign makes it non empty and the question mark makes it non greedy):

re.findall(r'[^\n\r ]+?(?=$|\n|\r| )', text)

['this', 'is', 'a', 'non-U.S', 'disclosures', 'analysis', 'agreements', 'disclaimer.', 'Please', 'keep', 'it', 'confidential']

answered Jan 29 '18 at 11:00

Veltzer Doron

934
2
10
31

what happen if my text has \t \xa0 will the above regular expression work? text = 'this/r/n/r/nis a non-U.S disclosures \t \xao analysis agreements disclaimer./tPlease keep it confidential' – kabilan karunakaran Jan 29 '18 at 12:07
If your asking if there's a shortcut for every subset of the set of delimiters and special characters then the answer is no. and as for your \xa0 character, convert to utf https://stackoverflow.com/a/11566398/374437 – Veltzer Doron Jan 29 '18 at 13:31

regular expressions in python need to retain special characters

2 Answers2