1

to describe the problem:

I have some text with mail-header-lines like

From: me
To: you
Subject: welcome, this is a long line of subject with two 
         lines of text
Attachements: welcome.jpg, foo.pdf

the last line ('Attachements: welcome.jpg, foo.pdf') is OPTIONAL. so maybe this text only looks like

From: me
To: you
Subject: welcome, this is a short line of subject

I need to extract the Subject-Line(s), without the text 'Subject:'. leading and trailing whitespace are no problem.

the only operation I can use is a SINGLE QT-Regex-call that returns a FULL MATCH ONLY.

great, isn't it ?

I tried with success

(?<=Subject:)(?:\s*)(.*)(?=Attachements:)

but how do I make the 'Attachements:'-line optional ?

When there is no Attachement:-line, I expect the text/string to end with the Subject:-line(s).

any idea ?

Marc
  • 13
  • 2

2 Answers2

0

You can use negative lookeahead (?!...) forAttachments

(?<=^Subject: )(?:(?!^Attachements:)[\s\S])+

Demo

Btw. I've changed .* to [\S\s]* to allow newlines in the subject.

mrzasa
  • 22,895
  • 11
  • 56
  • 94
  • thank you. I also tried the variants with \z. but: your sample does not contain the 'Attachement:'-line. and it does not cut out only the Subject:. – Marc Feb 14 '18 at 13:48
  • In fact, `([\S\s](?!Attachements:))+` is not a correct tempered greedy token, you should temper the dot before it matches a char, use `(?:(?!Attachements:)[\s\S])+` – Wiktor Stribiżew Feb 14 '18 at 14:16
  • thanks @WiktorStribiżew, could you tell why is that? - my regexp seem to work correctly (but I'm eaget to learn) – mrzasa Feb 14 '18 at 14:18
0

You may use

(?<=Subject:)\s*((?:(?![\r\n]Attachements:).)*)

See the regex demo

The pattern matches:

  • (?<=Subject:) - a positive lookbehind that matches a position inside a string where there is a Subject: substring immediately to the left of the current location
  • \s* - 0+ whitespace chars
  • ((?:(?![\r\n]Attachements:).)*) - Capturing group 1 that matches any char (QRegExp regex . pattern even matches line breaks), zero or more repetitions as many as possible (*), that does not start a CR/LF + Attachments: char sequence. This construct is called a tempered greedy token.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563