Regex match substring ignoring occurencies inside HTML

Question

I need to remove a code of digits, preceeded by an underscore in strings that may or may not be cointained in an HTML tag, that may or may not containt the same substring.

Example: remove _1234 from following strings:

this is my string_1234

<a href="link_1234">this is my html nested string_1234</a>

I just do:

$regex = '#\_(\d+)$#'; 
$name = preg_replace($regex, '', $name);

but I'm removing also the part inside the HREF, so I would like to generally exclude the any occurency that may happen inside the html tag.

EDIT: 1 thing I can be sure, the eventual HTML tag will always be a link... is there a way to ignore with regex anything inside <a ... > and </a>?

Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Justinas, Sep 28 '22 at 13:10
How do you generate that HTML? If `$name` is later used in HTML, then sanitize only `$name` and not HTML itself (have 2 variables for presentation and for link) — Justinas, Sep 28 '22 at 13:13
If you wanted something simple for what I think you want; you could use a negative lookahead and lookbehind on quotes? — jhylands, Sep 28 '22 at 13:25
It wouldn't work for class names where you might have class= "x_123 b_123" but would work for the other cases? — jhylands, Sep 28 '22 at 13:25
@Justinas I'm not really good with regexes, so sorry, but I can't really understand discussions in the answer you suggested :( Unfortunately I don't have 2 variables, I just have 1 result that sometimes is encapsulated in a html link and sometimes not. And I only want to edit the text, not the (eventual) html containing it. — bluantinoo, Sep 28 '22 at 13:35
@jhylands I'm not sure to understand how I would use the negative lookbehind in this case. It's ok not to match eventual class names. I really want to skip the whole html tag. In my (simple) mind, I just would ignore everything inside the <> and > chars... but I don'r really know how to — bluantinoo, Sep 28 '22 at 13:35
PCRE verbs could be used for this. Do you mean `\_(\d+)$` should only match if not inside links or only not match if in the link attributes? Also `_` is not a special char so doesnt need to be escaped.... Also HTML regexs aren't 100% so likely will run into some issues but can get close. — user3783243, Sep 28 '22 at 16:09

score -1 · Answer 1 · answered Sep 28 '22 at 13:33

-1

Perhaps not exactly what you're after but work based on quotes. This contains a negative lookahead for a quote (or digit, stops it matching simply a smaller amount of inside the href) and a negative lookbehind in the same way. The word aspect is still matched and stored in group 1. That can then be used as the replacement value.

$regex = '(?<!"|\w)(\w+)_\d+(?!"|\d)'; 
$name = preg_replace($regex, '$1', $name);

https://regex101.com/r/2agOwr/1

answered Sep 28 '22 at 13:33

jhylands

984
8
16

tried this on regex101, but it does not seem to work on a real example. I've added a 3rd string, have a look: https://regex101.com/r/7Tt9fJ/1 it does not match what really I need to remove: _43223, but it matches the class name inside the html tag – bluantinoo Sep 28 '22 at 13:44
Does not work if using single quotes _or_ text contains quotes, like `this is my "html nested string_1234"` – Justinas Sep 28 '22 at 13:44
Fair enough. I don't think a single regex can do this. There is probably a way to reason about that using the fact that a fsa can't match brackets. Maybe a 2 teir system would work where you break the string up in code. Only run the regex on the inside of tags. – jhylands Sep 28 '22 at 16:17

Regex match substring ignoring occurencies inside HTML

1 Answers1