1

Python newbie here, using 3.5. I feel this question is similar to others asked here, but despite having read those and trying to follow the advice given, I'm still not getting anywhere with this regex.

I have a string of text wherein I want to replace, with a space, all newlines which are not followed by either another newline or three spaces. I'm attempting to do this using a regular expression with a negative lookahead. I've learned I need to use multiline from this conversation. Still, though, my regex isn't identifying anything in my string. Basically, I want to match and replace the \r\n in the middle of the string below, while leaving those at the beginning and end of the string untouched.

body = 'foo foo\r\n\xa0\xa0\xa0foo foo foo\r\n\foo foo foo foo foo\r\n\r\n\foo foo foo'

breakRegex = re.compile(r'(\r\n)?!(\r\n)|(\r\n)?!(\s\s\s)', s,re.M)

breakRegex.sub(' ', body)

The desired and so-far-unreached outcome would be:

'foo foo\r\n\xa0\xa0\xa0foo foo foo foo foo foo foo foo\r\n\r\n\foo foo foo'

I've also tried the above without so many parentheses, substituting \s for \xa0 and more, but it still doesn't work... Thanks for any help you can give.

  • In the case where there are multiple newlines next to each other, to you want them all preserved, all but the last preserved, or only one left? – Patrick Haugh Jul 02 '17 at 16:36
  • Why isn't the last newline (in `\r\n\r\n\foo foo foo`) removed? – Aran-Fey Jul 02 '17 at 16:39
  • Thanks for the responses! Good questions, which my plan (and I) didn't sufficiently consider... I think it would actually be better if the additional \r\n were removed so that only one was left, though originally I'd wanted them preserved (despite my errant approach). – Peter Pressman Jul 03 '17 at 18:17

2 Answers2

0

Is this what you want?

break_regex = re.compile(r'\r\n(?!=\r\n|\s\s\s)', re.M)

all newlines \r\n, which are not followed by(?!=...), either ( | ), another newline \r\n, or three spaces \s\s\s.

Edit:

  1. Sorry, I made a mistake, and you should remove the = in regex, as soon as possible. :)

  2. Did you mean this?:

body = 'foo foo\r\n\xa0\xa0\xa0foo foo foo\r\nfoo foo foo foo foo\r\n\r\nfoo foo foo'

Instead of:

body = 'foo foo\r\n\xa0\xa0\xa0foo foo foo\r\n\foo foo foo foo foo\r\n\r\n\foo foo foo'`

Because \f means Formfeed (0x0c).

Masood Lapeh
  • 196
  • 3
  • 7
  • What you're typing makes sense to me, but I tried this regex both within the actual program and within pythex.org, and it doesn't recognize the line break. – Peter Pressman Jul 03 '17 at 18:21
  • Thanks for the correction. Still, though, after removing the = sign, the regex doesn't capture anything. You're right regarding the \f, though, I did meanbody = 'foo foo\r\n\xa0\xa0\xa0foo foo foo\r\nfoo foo foo foo foo\r\n\r\nfoo foo foo' – Peter Pressman Jul 05 '17 at 17:42
  • Also, as an FYI, while the r' version didn't work on Pythex, \\r\\n(?!\\r\\n|\\xa0\\xa0\\xa0) selects the desired group on Pythex. The version you give above, but with xa0 instead of s, seems to work in the actual code, which I imagine is an encoding issue. Thanks for your help! – Peter Pressman Jul 05 '17 at 19:01
  • You're welcome. The reason that Pythex doesn't work that way, is that: Unlike in your regex string, Pythex doesn't consider '\' in your test string as escape character, but as an actual '\', so they won't match. when you change your regex to \\r\\n(?!\\r\\n|\\xa0\\xa0\\xa0), Pythex thinks your regex wants to match with an actual '\', so they match. – Masood Lapeh Jul 05 '17 at 23:55
  • Also the difference between \s and \xa0 is that, \s matches with many things including: simple space character (' '), tab ('\t'), vertical tab ('v'), new line character ('\n'), carriage return ('\r'), form feed ('\f'). But \xa0 just matches with 'non-breaking space'. Also watch out for '\r\n', some platforms use simple \n instead of that. – Masood Lapeh Jul 06 '17 at 00:10
0
def clean_with_puncutation(text):    
    from string import punctuation
    import re
    punctuation_token={p:'<PUNC_'+p+'>' for p in punctuation}
    punctuation_token['<br/>']="<TOKEN_BL>"
    punctuation_token['\n']="<TOKEN_NL>"
    punctuation_token['<EOF>']='<TOKEN_EOF>'
    punctuation_token['<SOF>']='<TOKEN_SOF>'
  #punctuation_token



    regex = r"(<br/>)|(<EOF>)|(<SOF>)|[\n\!\@\#\$\%\^\&\*\(\)\[\]\
           {\}\;\:\,\.\/\?\|\`\_\\+\\\=\~\-\<\>]"

###Always put new sequence token at front to avoid overlapping results
 #text = '<EOF>!@#$%^&*()[]{};:,./<>?\|`~-= _+\<br/>\n <SOF>\ '
    text_=""

    matches = re.finditer(regex, text)

    index=0

    for match in matches:
     #print(match.group())
     #print(punctuation_token[match.group()])
     #print ("Match at index: %s, %s" % (match.start(), match.end()))
        text_=text_+ text[index:match.start()] +" " 
              +punctuation_token[match.group()]+ " "
        index=match.end()
    return text_