Regex: lazy match left

Question

I have a following example:

        <strong><span style="text-decoration: underline;">LAbel<br>
    </span></strong>
<span style="color: #1f497d;">Label:</span>&nbsp;
[[<span href="#" style="background: red; color: white;" field-id="97c97578-ac1b-4495-a3a7-85e75d0acf40"> some text ... </span>]]&nbsp;
[[<span href="#" style="background: red; color: white;" field-id="db983948-6458-4be8-9044-174093d39976"> some other text ... </span>]]<br>

I need to find and replace a snippet like:

[[<span somestyle_and_attributes field-id="some GUID"> some random text </span>]]

In my example I want to find and replace this:

[[<span href="#" style="background: red; color: white;" field-id="db983948-6458-4be8-9044-174093d39976"> some other text ... </span>]]

My pattern is:

\[\[<span .+? field-id="db983948-6458-4be8-9044-174093d39976">.+?</span>\]\]

But since I want regex to find a match starting from the GUID and field-id and go a little bit back (till the closest left [[span) it also include the preceding span tag.

I can include everything between opened span tag to the pattern (styles, attributes etc), but I feel like there is much simpler way to find the left closest match.

[can-regular-expressions-parse-html-or-not](https://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/) — Mong Zhu, Feb 01 '18 at 13:10
Wouldn't it be much easier to parse the HTML as an XML file and just match on the attribute field-id? — RMH, Feb 01 '18 at 13:19
no, it would not. It is a content from a WYSIWYG editor. Users add variables like in the example ([[lalala guid span]]) so visually they can see it. Later in the backend I want to replace them with real values. Regex is more then enough. — DolceVita, Feb 01 '18 at 13:21
[RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) Different question, but a good accepted answer in general when talking about Regex + HTML — grek40, Feb 01 '18 at 13:37

jalsh · Accepted Answer · 2018-02-01T13:26:46.553

-1

You could try something like:

\[\[<span [^>]* field-id="db983948-6458-4be8-9044-174093d39976">.*</span>\]\]

Update: Thanks to @Juharr and @Mong Zhu who noted that one shouldn't parse HTML dom trees using Regex https://www.johndcook.com/blog/2013/02/21/can-regular-expressions-parse-html-or-not/

edited Feb 01 '18 at 13:26

answered Feb 01 '18 at 13:17

jalsh

801
6
18

1

That's not going to work if the span as any nested nodes. Not sure if that's something the OP needs or not, but should be pointed out. – juharr Feb 01 '18 at 13:20
@juharr Thanks!... Updated, Please advice if this would produce the required values or not – jalsh Feb 01 '18 at 13:22
this is so simple and exactly what I wanted. This will work for me, because initial span cannot have any nested nodes. It always stay the same and only GUID can be differnt. So thanks, I mark this as a solution for me. – DolceVita Feb 01 '18 at 13:23
1

That's the fundamental issue with regex and html. Regex doesn't easily handle an undefined amount of nesting. Imagine `[[]]` – juharr Feb 01 '18 at 13:23
1

@DolceVita In that case you want the original version with `[^<]*` instead of the edited version. – juharr Feb 01 '18 at 13:25
@juharr You're absolutely right, HTML is a tree it shouldn't be parsed using Regex – jalsh Feb 01 '18 at 13:25
don't make it more complicated then it is :). If I would have any nested nodes - I mentioned it. My code is always like in the post and only GUID can be changed. It is wysiwyg editor content and span is always inserted by javascript and cannot be changed. – DolceVita Feb 01 '18 at 13:26
@DolceVita When asking questions about Regex you should mention everything that limits the various possibilities so that a solution can be as simple as possible. So mentioning that the spans will not have nested nodes under them would be useful information. We cannot assume the absence of information to mean anything. – juharr Feb 01 '18 at 13:30

Regex: lazy match left

1 Answers1