0

I am looking for a regex solution for this problem. It can be a multiple step solution if this makes things easier. Important notice: The test string is just a snippet of a complete HTML DOM and only images should get addressed by this and any other URL should be left alone.

Here's an image:

<img 
src="https://www.example.com/de/wp-content/uploads/sites/1/2017/03/image.jpg"
data-srcset="
 https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img1.jpg 507w,
 https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img2.jpg 780w,
 https://www.example.com/de/wp-content/uploads/sites/74/2017/03/img3.jpg 950w"
data-sizes="
 (min-width: 80em) calc(0.5 * (100vw - (100vw- 57em))),
 (min-width: 48em) calc(0.5 * (100vw - 5em)),
 calc(100vw - 1em)"
alt="image" class="lazyload">

As a oneliner:

<img src="https://www.example.com/de/wp-content/uploads/sites/1/2017/03/image.jpg" data-srcset="https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img1.jpg 507w, https://www.example.com/de/wp-content/uploads/sites/1/2017/03/img2.jpg 780w, https://www.example.com/de/wp-content/uploads/sites/74/2017/03/img3.jpg 950w" data-sizes="(min-width: 80em) calc(0.5 * (100vw - (100vw- 57em))), (min-width: 48em) calc(0.5 * (100vw - 5em)), calc(100vw - 1em)" alt="image" class="lazyload">

The desired result is that need to get rid of protocol, domain, and first directory - that is to say: everything in front of the /wp-content. The language I am doing this in is php.

For the src part I have

 preg_replace("/(<img.*?src=\")(.*?)(\/wp-content.*?\")(.*>)/", '"$1$3$4"', $string);

The answer below is correct. Most HTML documents should be able to load. Do yourself a favor and try to be as valid as possible, this is a good thing anyways. If you don't produce the HTML in question yourself, try to process it before you consume it.

For the data-srcset problem just parse that argument separately.

Compare your DOM before and after completely. The @dom->saveHTML() method makes closed tags which do not need to be closed, closed. Like <meta arg="yada"/> turns to <meta arg="yada"> (closing backslash missing). Also see Are (non-void) self-closing tags valid in HTML5?

Community
  • 1
  • 1
wloske
  • 189
  • 1
  • 6

1 Answers1

0

Don't. Use a parser to analyze the DOM and apply the regex on the DOM elements/attributes directly.

<?php

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);

$xpath = new DOMXPath($dom);
$images = $xpath->query("//img[contains(@src, 'wp-content')]");

$regex = '~^.+?(?=/wp-content/)~';
foreach($images as $img) {
    $img->setAttribute('src', 
        preg_replace($regex, 'https://anotherdomain.com', $img->getAttribute('src'))
    );
}

echo $dom->saveHTML();

It has been answered a dozen times why it is not a good idea to parse HTML with regular expressions, one of the most favourite answers being this: RegEx match open tags except XHTML self-contained tags.


However, if your HTML is not valid, you could use the following regex (in verbose mode):
(?:\G(?!\A)|<img)
(?s:.+?\bsrc=['"])\K
https?://.+?(?=/wp-content/)

See it working on regex101.com.

Community
  • 1
  • 1
Jan
  • 42,290
  • 8
  • 54
  • 79
  • Don't? Elaborate, please. – wloske Mar 30 '17 at 14:39
  • "Dozen times" ... for regular readers or those asking the right questions ;-) If I had stumbled upon this ... funny answer ;-) ... – wloske Mar 30 '17 at 14:45
  • @wloske: It is, indeed :) – Jan Mar 30 '17 at 14:50
  • Oh, did I mention your example does not work :-) Basically because my HTML is not valid and I also can not guarantee it. – wloske Mar 30 '17 at 14:51
  • @wloske: Then see the updated answer at the bottom. Of course, it is possible to analyze HTML strings via regex though not advisable. – Jan Mar 30 '17 at 14:59
  • Thanks, @Jan, for this approach but it still does not tackle the main problem: The data-srcset part with an unknown number of absolute URLs separated by a comma. – wloske Mar 31 '17 at 08:37