Add exceptions to complex regular expression (lookahead and lookbehind utilized)

Question

I'd like some help with regular expressions because I'm not really familiar with. So far, I have created the following regex:

/\b(?<![\#\-\/\>])literal(?![\<\'\"])\b/i

As https://regex101.com/ states:

\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)

Negative Lookbehind (?])

Assert that the Regex below does not match

Match a single character present in the list below [#-/>]

# matches the character # literally (case insensitive)

- matches the character - literally (case insensitive)

/ matches the character / literally (case insensitive)

> matches the character > literally (case insensitive)

literal matches the characters literal literally (case insensitive)

Negative Lookahead (?![\<\'\"])

Assert that the Regex below does not match

Match a single character present in the list below [\<\'\"]

\< matches the character < literally (case insensitive)

\' matches the character ' literally (case insensitive)

\" matches the character " literally (case insensitive)

\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)

Global pattern flags

i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

I want to add two exceptions to this matching rule. 1) if the ">" is preceded by "p", that is for example a <p> starting tag, to match the literal only. 2) Also the literal should only be matched when < is follwed by /p, that is for example a </p> closing tag. How can achieve this ?

Example: only the bold ones should match.

<p>
    **Literal** in computer science is a
    <a href='http://www.google.com/something/literal#literal'>literal</a>
    for representing a fixed value in source code. Almost all programming 
    <a href='http://www.google.com/something/else-literal#literal'>languages</a>
    have notations for atomic values such as integers, floating-point 
    numbers, and strings, and usually for booleans and characters; some
    also have notations for elements of enumerated types and compound
    values such as arrays, records, and objects. An anonymous function
    is a **literal** for the function type which is **LITERAL**
</p>

I know I have over-complicated things, but the situation is complicated itself and I think I have no other way.

Can you give an example of input and output of what you're trying to do with it? And what programming language are you using the regex with? — 4castle, Oct 02 '16 at 15:55
@4castle I have added an example. Would you mind editing it again as before? No clue how to add actual html. — dpesios, Oct 02 '16 at 16:42
What programming language is this in? It looks like you need an HTML parser, and not a regex. Please read about the [XY Problem](http://mywiki.wooledge.org/XyProblem). — 4castle, Oct 02 '16 at 16:48
They never learn, no matter how hard we try, they keep coming back again and again. Please, do not parse HTML with REGEX use an HTML parser: http://stackoverflow.com/a/1732454/460557 — Jorge Campos, Oct 02 '16 at 16:53
It is Ruby. It is stored in text filed in a postgresql db. I don't think i need an html parser. The stored text is of "
..text...
" form. I'm doing matching to substitute certain keyphrases with their corresponding links .. but re-matching happens and this is what im trying to solve. I know im not quite clear but as said the situation is complicated. If i get the right regex i will find my way round. — dpesios, Oct 02 '16 at 16:53
@JorgeCampos That answer is misconstrued way too much. There are many cases where isolated HTML can be parsed with regex. The info the OP just provided means that it is feasible to do with regex. — 4castle, Oct 02 '16 at 16:54
If this simplifies things there can be only two kind of tags in the text. The p tag and the a tag. — dpesios, Oct 02 '16 at 16:59
Can you remove `
` and `
` from the string before you do the matching? And then add them back later? — 4castle, Oct 02 '16 at 17:04
There is a saying in my country that says "if you want to crack a nut do not use a sledgehammer, a simple nutcracker is enough" @4castle There are many
...text...
one after the other stored. I gave it a thought also and don't think is enough. Sorry... — dpesios, Oct 02 '16 at 17:10

4castle · Accepted Answer · 2016-10-02T17:33:16.580

0

If the text you're searching is just text mixed with some <a> tags, then you can simplify the < and > parts of the lookarounds, and give a specific string that it shouldn't be followed by: </a>.

/\b(?<![-#\/])literal(?!<\/a>)\b/i

Regex101 Demo

edited Oct 02 '16 at 17:33

answered Oct 02 '16 at 17:17

4castle

32,613
11
69
106

Thanks! Good approach, did not think of it. – dpesios Oct 02 '16 at 17:36

Add exceptions to complex regular expression (lookahead and lookbehind utilized)

1 Answers1