0

I've googled and googled, and look at dozens of other answers but can't find anything that address removed TWO lines that begin with one string, and end with another, so am not including "what i've tried" because the dozen or so patterns don't even come close.

We've extracted text form PDF files, and all the links within the result appear in the output as two lines like this:

[Link] 2017_07_11_RM_4b.pdf

They always start with [Link], and always end with .pdf. They appear throughout the result, often many times in a row, then a block of text, and more links, and so on - as many as 200+ occurrences. I'm trying to get the block of text without the occurrences of these two-line strings with a preg_replace() that presumably looks something like this:

$newtext = preg_replace("/^[Link]*$/", "", $text);

Any assistance is appreciated, thank you.

GDP
  • 8,109
  • 6
  • 45
  • 82

4 Answers4

0

This expression (example at regex101.com) uses the multiline pattern modifier which changes ^ to match the start of every line instead of the start of the string.

/(?:^\[Link\]\n[^\n]*+\n)++/m

$newtext = preg_replace("/(?:^\[Link\]\n[^\n]*+\n)++/m", "", $text);

Additional notes

  • We are using possessive quantifiers to prevent unnecessary backtracking.
Micha Wiedenmann
  • 19,979
  • 21
  • 92
  • 137
  • So does this simply remove "two" lines, or how does the trailing '.pdf' get considered? – GDP Jan 22 '18 at 15:18
  • You are right, I missed the `.pdf` is that good enough or can you work from there on your own or do you need further assistance? – Micha Wiedenmann Jan 22 '18 at 15:21
  • Going to try it now, just wanted that clarification before I started the hair pulling, thanks. Should be able to accept your answer shortly, assuming all goes well. – GDP Jan 22 '18 at 15:24
0

This may work:

/^\[Link\]\s*(\w+)\.pdf$/m

Here you're looking for a multi line text where it begins with [Link] (in this case [ and ] are literal that's why the backslash) followed by an empty space \s, in your case a new line, and then any amount of letters, numbers and underscores, with a .pdf at the end of the string.

It's important to notice that this is going to create a catch group for your desired text, hence in your preg_replace you should now do something like:

$newtext = preg_replace("/^\[Link\]\s*(\w+)\.pdf$/m", "$1", $text);
Carlos Afonso
  • 1,927
  • 1
  • 12
  • 22
  • `(?:)` prevents the capturing group. – Micha Wiedenmann Jan 22 '18 at 15:25
  • I'm not preventing it purposely. The group is exactly what he needs in order to get the desired output. – Carlos Afonso Jan 22 '18 at 15:31
  • We don't need group at all and your regex doesn't match filenames with special character. – Toto Jan 22 '18 at 15:54
  • If there are special characters in you filename you can simply replace `(\w+)` by `(.+?)` like this: `/^\[Link\]\s*(.+?)\.pdf$/m`. Also the group is only used to ease the capturing of the desired text removing all the undesired text. – Carlos Afonso Jan 22 '18 at 16:09
0

This should do it: \[Link\][\s\S]*?\.pdf\s

Demonstration: https://regex101.com/r/NCqWES/2/

Explanation:

  • [\s\S] - This means that we're matching every whitespace or non-whitespace character, which in turn means that we're matching all possible characters, including the possbile line breaks and whitespaces that are separating the word \[Link\] from the word \.pdf.

  • *? - This is a lazy quantifier, which will stop at the first occurence of the match.

  • Finally, I included a \s at the end to remove the remaining line break, but you could suppress it as well.

Update:

This may also work as well: \[Link\]\s\w+\.pdf\s, giving you a little performance gain. Click for demo.

Pedro Corso
  • 557
  • 8
  • 22
  • This is the only answer that works (the original pattern), BUT only on regex101. In my PHP, I'm getting ` preg_replace(): Delimiter must not be alphanumeric or backslash in `. Tried changing to forward slashes, but only the text following the last match is returned – GDP Jan 22 '18 at 15:58
  • According to this answer: https://stackoverflow.com/a/7660574/6882194 you should do it like this: `/\[Link\][\s\S]*?\.pdf\s/`. Or, using the alternative regex: `/\[Link\]\s\w+\.pdf\s/`. – Pedro Corso Jan 22 '18 at 16:04
  • That did it! Thank you - one last clarification, I tried pdf|asx and it broke - (turns out there's an occasional asx link) Can this work for multiple extensions? If not i can work around it - Thank you again – GDP Jan 22 '18 at 16:10
  • You can add any number of filetypes to this regex. It should stay like this: `/\[Link\][\s\S]*?\.(pdf|asx)\s/` – Pedro Corso Jan 22 '18 at 16:13
  • Ah, forgot the () - that did it, THANKS again – GDP Jan 22 '18 at 16:16
0
$str = <<<EOD
line1
[Link]
2017_07_11_RM_4b.pdf
line2
[Link]
2017_07_11_RM_4b.pdf
line3
EOD;
$newtext = preg_replace("/\[Link\]\R.+\.pdf\R/", "", $str);;
echo $newtext,"\n";

Output:

line1
line2
line3

Explanation:

  \[Link\]  : literally [link]
  \R        : any kind of linebreak
  .+        : 1 or more any character but newline
  \.        : a dot
  pdf       : literally pdf
  \R        : any kind of linebreak
Toto
  • 89,455
  • 62
  • 89
  • 125