0

I'd like some help with regular expressions because I'm not really familiar with. So far, I have created the following regex:

/\b(?<![\#\-\/\>])literal(?![\<\'\"])\b/i

As https://regex101.com/ states:

\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)

Negative Lookbehind (?])

Assert that the Regex below does not match

Match a single character present in the list below [#-/>]

# matches the character # literally (case insensitive)

- matches the character - literally (case insensitive)

/ matches the character / literally (case insensitive)

> matches the character > literally (case insensitive)

literal matches the characters literal literally (case insensitive)

Negative Lookahead (?![\<\'\"])

Assert that the Regex below does not match

Match a single character present in the list below [\<\'\"]

\< matches the character < literally (case insensitive)

\' matches the character ' literally (case insensitive)

\" matches the character " literally (case insensitive)

\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)

Global pattern flags

i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])

I want to add two exceptions to this matching rule. 1) if the ">" is preceded by "p", that is for example a <p> starting tag, to match the literal only. 2) Also the literal should only be matched when < is follwed by /p, that is for example a </p> closing tag. How can achieve this ?

Example: only the bold ones should match.

<p>
    **Literal** in computer science is a
    <a href='http://www.google.com/something/literal#literal'>literal</a>
    for representing a fixed value in source code. Almost all programming 
    <a href='http://www.google.com/something/else-literal#literal'>languages</a>
    have notations for atomic values such as integers, floating-point 
    numbers, and strings, and usually for booleans and characters; some
    also have notations for elements of enumerated types and compound
    values such as arrays, records, and objects. An anonymous function
    is a **literal** for the function type which is **LITERAL**
</p>

I know I have over-complicated things, but the situation is complicated itself and I think I have no other way.

4castle
  • 32,613
  • 11
  • 69
  • 106
dpesios
  • 47
  • 1
  • 8
  • 2
    Can you give an example of input and output of what you're trying to do with it? And what programming language are you using the regex with? – 4castle Oct 02 '16 at 15:55
  • @4castle I have added an example. Would you mind editing it again as before? No clue how to add actual html. – dpesios Oct 02 '16 at 16:42
  • 2
    What programming language is this in? It looks like you need an HTML parser, and not a regex. Please read about the [XY Problem](http://mywiki.wooledge.org/XyProblem). – 4castle Oct 02 '16 at 16:48
  • They never learn, no matter how hard we try, they keep coming back again and again. Please, do not parse HTML with REGEX use an HTML parser: http://stackoverflow.com/a/1732454/460557 – Jorge Campos Oct 02 '16 at 16:53
  • It is Ruby. It is stored in text filed in a postgresql db. I don't think i need an html parser. The stored text is of "

    ..text...

    " form. I'm doing matching to substitute certain keyphrases with their corresponding links .. but re-matching happens and this is what im trying to solve. I know im not quite clear but as said the situation is complicated. If i get the right regex i will find my way round.
    – dpesios Oct 02 '16 at 16:53
  • @JorgeCampos That answer is misconstrued way too much. There are many cases where isolated HTML can be parsed with regex. The info the OP just provided means that it is feasible to do with regex. – 4castle Oct 02 '16 at 16:54
  • 1
    If this simplifies things there can be only two kind of tags in the text. The p tag and the a tag. – dpesios Oct 02 '16 at 16:59
  • Can you remove `

    ` and `

    ` from the string before you do the matching? And then add them back later?
    – 4castle Oct 02 '16 at 17:04
  • There is a saying in my country that says "if you want to crack a nut do not use a sledgehammer, a simple nutcracker is enough" @4castle There are many

    ...text...

    one after the other stored. I gave it a thought also and don't think is enough. Sorry...
    – dpesios Oct 02 '16 at 17:10

1 Answers1

0

If the text you're searching is just text mixed with some <a> tags, then you can simplify the < and > parts of the lookarounds, and give a specific string that it shouldn't be followed by: </a>.

/\b(?<![-#\/])literal(?!<\/a>)\b/i

Regex101 Demo

4castle
  • 32,613
  • 11
  • 69
  • 106