Is there a way to only strip newlines from urls while preserving others

Question

I am trying to strip newlines only from the urls in a text body while preserving all other newlines. The other newlines will be converted into <p> tags later on, so it is important to keep them as is.

Given this text example with multiple tags with urls (this should be all on one line, line breaks are only for readability purposes):

"Lorem ipsum dolor <a href=\"http://www.example.org/page?\nid=161&_te=mj\">sit</a> amet,
consectetur adipiscing elit. \nDuis facilisis eros at sem faucibus finibus. Integer tempus
lectus sed gravida efficitur. Proin dignissim pretium arcu, accumsan gravida ex tincidunt
eget. Maecenas ac finibus elit. Maecenas aliquam fermentum nisl quis egestas. 
<a href=\"http://www.example.org/page?id=341\n&_te=mp\">Nulla placerat est vitae convallis</a> 
euismod. Praesent id elit a ligula hendrerit lacinia."

I have found a way to isolate one of the links in the text by using this pattern

\bhttps?:\/\/[^<>]+(?:\([\w\d]+\)|[^,[:punct:]\s]|\/)

However I am totally lost on how to then strip the newline out of that substring, or only match on the newline within the url pattern. Ultimately this is going into some php code and will need to replace all cases of this within the string, so if there are better php utility methods that will do this, I'm all ears!

What I am going for is this (only \n in urls are removed, all others preserved):

"Lorem ipsum dolor <a href=\"http://www.example.org/page?id=161&_te=mj\">sit</a> amet,
consectetur adipiscing elit. \nDuis facilisis eros at sem faucibus finibus. Integer tempus
lectus sed gravida efficitur. Proin dignissim pretium arcu, accumsan gravida ex tincidunt
eget. Maecenas ac finibus elit. Maecenas aliquam fermentum nisl quis egestas. 
<a href=\"http://www.example.org/page?id=341&_te=mp\">Nulla placerat est vitae convallis</a> 
euismod. Praesent id elit a ligula hendrerit lacinia."

This is a very specific question and does not match any of the duplicates associated with it. I am dealing with a legacy system and cannot solve this using DOM parsing. Please answer this question using regex.

It is quite common knowledge that HTML cannot be parsed with regex, look for example at [this answer](https://stackoverflow.com/a/1732454/4165552). Instead of shouting "I need help, regex only", take the advice from people that are experienced in this field and use XML parser. If you have difficulties to integrate XML parser in your legacy code, ask another question about that specific problem. — pptaszni, Sep 09 '20 at 07:36
I am fixing a small bug in a system that is being completely rewritten, so I don't want to rewrite everything using an xml parser when it will all be scrapped really soon. All I'm looking for is an answer to this specific question using regex, which is unfortunately the way it is built. Probably why we have this bug in the first place. I know that this is not the best way to do this; but, given this context can anyone actually answer this question rather than lecturing me on the way it was originally built? — user9169828, Sep 10 '20 at 18:38

Casimir et Hippolyte · Answer 1 · 2020-09-05T19:54:58.387

Don't try to parse html with regex, at best you will obtain something that works until it no longer works! HTML is a too complicated language full of traps even if it looks simple at first glance.

When you want to edit programming code whatever the language, the first reflex is to look after a parser for this language. With PHP there's a DOM parser for XML and HTML.

With it you can follow these simple steps:

extract the href attribute
replace the newlines in it
set the href attribute with the result.

Since you work on a partial html document^(*), you have to wrap it in a full HTML document structure to avoid automatic corrections from the parser (including the addition of a root element). Obviously at the end you need to remove this wrapper extracting child nodes from the body.

$str = "Lorem ipsum dolor <a href=\"http://www.example.org/page?\nid=161&_te=mj\">sit</a> amet,
consectetur adipiscing elit. \nDuis facilisis eros at sem faucibus finibus. Integer tempus
lectus sed gravida efficitur. Proin dignissim pretium arcu, accumsan gravida ex tincidunt
eget. Maecenas ac finibus elit. Maecenas aliquam fermentum nisl quis egestas. 
<a href=\"http://www.example.org/page?\nid=341&_te=mp\">Nulla placerat est vitae convallis</a> 
euismod. Praesent id elit a ligula hendrerit lacinia.";

libxml_use_internal_errors(true);

$wrapper = '<html><head><meta charset="utf-8"/></head><body>%s</body></html>';
$html = sprintf($wrapper, $str);

$dom = new DOMDocument;
$dom->loadHTML($html);

foreach ($dom->getElementsByTagName('a') as $aElt) {
    $href = $aElt->getAttribute('href');
    $href = str_replace("\n", '', $href);
    $aElt->setAttribute('href', $href);
}

$result = '';

$bodyElt = $dom->getElementsByTagName('body')->item(0);

foreach ($bodyElt->childNodes as $node) {
    $result .= $dom->saveHTML($node);
}

echo $result;

_{(*) If you work with a full HTML document, you don't need a wrapper and you can get the result directly using the DOMDocument::saveHTML() method without parameter.}

I am working with legacy code and do not have the ability to refactor at this time. Please answer the question using regex as requested. — user9169828, Sep 05 '20 at 22:05

Is there a way to only strip newlines from urls while preserving others

1 Answers1