-1

I'm wondering how to remove any header of previous message in an email. There is an example of message:

Something above

-----Message d'origine-----
De : Myself <myself@himself.com>
Envoyé : vendredi 8 mars 2019 14:30
À : Someone <someone@himself.com>
Cc : AnotherGuy <another@himself.com>
Objet : My bad I forgot how to do it

Hi,

blabla

And I need to remove everything between -----Message d'origine----- and the carriage return and the empty new line before "Hi,".

I've tried the following regex:

-----Message d'origine-----[\s\S]*?[\r\n]

But only -----Message d'origine----- matched without the other lines below. However, if I use instead "Hi,", it matches all lines with it:

-----Message d'origine-----[\s\S]*?Hi

Is anyone can help me where is the problem and how to use instead the carriage return and the empty new line ?

Thank you :)

toshiro92
  • 1,287
  • 5
  • 28
  • 42

1 Answers1

5

You need to match until the first occurrence of double linebreak:

r"-----Message d'origine-----[\s\S]*?(?:\r?\n){2}"
                                     ^^^^^^^^^^^^

See the regex demo. The (?:\r?\n){2} pattern matches two repetitions of a CRLF or LF line endings.

Sample Python code:

import re
s = "YOUR STRING HERE"
s = re.sub(r"-----Message d'origine-----.*?(?:\r?\n){2}", '', s, flags=re.S)

Note that [\s\S] is equal to . in a regex when the re.S (=re.DOTALL flag is used).

If you are concerned with performance that is impacted by the non-greedy .*? pattern, unroll it as

s = re.sub(r"-----Message d'origine-----.*(?:\r?\n(?!\r?\n).*)*\s*", "", s)

See this regex demo. Do not use re.S / re.DOTALL with this pattern!

The [\s\S]*?(?:\r?\n){2} is now .*(?:\r?\n(?!\r?\n).*)*:

  • .* - the rest of the line
  • (?:\r?\n(?!\r?\n).*)* - 0 or more repetitions of
    • \r?\n(?!\r?\n) - a linebreak not followed with another line break
    • .* - the rest of the line
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Why would the non-greedy pattern be less efficient than your alternative? I don’t know the internals of Python’s regex engine but naively they should be performing pretty much the same kind of work. – Konrad Rudolph Apr 10 '19 at 10:12
  • @KonradRudolph Check the number of steps for both regexps at regex101: 582 / 72. The difference is due to backtracking with `.*?` pattern since the pattern is expanded each time the subsequent patterns fail to match. `.*` grabs all chars up to the line break and there is less backtracking here since a check is made after the end of line is found, no need to expand any pattern char by char. – Wiktor Stribiżew Apr 10 '19 at 10:16
  • Hmm this strikes me as a performance bug in the implementation: surely the nongreedy match could be implemented in the same way? – Konrad Rudolph Apr 10 '19 at 10:18
  • 1
    @KonradRudolph That is expected behavior and no bug. See [Can I improve performance of this regular expression further](https://stackoverflow.com/questions/33869557) for a concrete example of how `.*` and `.*?` match. When dealing with long texts it is a bad idea to rely on `.*?` or `.*` if you are not sure about match positions. [Unroll the loop](https://stackoverflow.com/a/38018490/3832970) method should be used then, or discard regex altogether in favor of some parsing method. – Wiktor Stribiżew Apr 10 '19 at 10:20
  • Right, I got confused. I misunderstood you as saying that the `.*?` *itself* backtracks, rather than the following `(?:\r?\n){2}` (which is matched at each steps and usually fails). I had also missed that your second example isn’t running in single-line mode, so `.` simply matches *less*, and *that’s* why it’s more efficient than the non-greedy case. – Konrad Rudolph Apr 10 '19 at 10:30
  • 1
    Great @WiktorStribiżew thanks for all the tips you provided :) I was pretty near but I did not know the concept of flags here, and how to improve performance here – toshiro92 Apr 10 '19 at 11:43