-2

I want to find all " - quotes (just the char) in a text to replace those with ". In some texts are also a-tags. To keep the a-tag functional, I don't want to replace the " in the beginning a-tag.

I tried the following, but it also matches those in the tag:

(?!<a.*(").*>)"

https://regex101.com/r/eyEF5K/2/

  • 3
    Not my downvote, but *definitely* a good example of exactly why regular expressions are not the right tool for this. Maybe see also https://meta.stackoverflow.com/questions/261561/please-stop-linking-to-the-zalgo-anti-cthulhu-regex-rant – tripleee Jul 06 '20 at 20:30
  • Just move past the tag and match the quote to replace the entity. I'd match all entities though. What regex engine is being used, the regex is fairly easy. If its PCRE use skip fail for the tags. If not pcre, just match both tags and entites you want to substitute. In a callback decide which matched, etc .. –  Jul 06 '20 at 21:21

2 Answers2

0

Before anyone decides to implement this in production, look at this post. HTML and regex don't mix well, so please do not use this answer unless it's a quick hack that you're trying to do.

To replace all instances of " except for those inside the <a> tag, you can use the following. Of course, this assumes that the character > is invalid within the tag (<a param='>' href=""> breaks this for example).

Also, depends on your regex engine. This works in PCRE for example (among others), but you didn't specify a language, so I'm assuming anything goes.

See regex in use here

<a[^>]*>(*SKIP)(*FAIL)|"

It works as follows:

  • Match either of the following options
    • <a[^>]*>(*SKIP)(*FAIL) match the following
      • <a match this literally
      • [^>]* match any character except > any number of times
      • > match this character literally
      • (*SKIP)(*FAIL) magic - see this post for more info. Basically allows you to consume the characters, but then exclude them from the match.
    • " match this literally

We're effectively matching all " but skipping all the <a ... > tags in our matching pattern.

ctwheels
  • 21,901
  • 9
  • 42
  • 77
0

This is PCRE specific, each match is one of these entities [<>"&']
that do not exist inside of any tag or invisible content like scripts.
It bypasses all tags using the (*SKIP)(*FAIL) verb combination.

Change the class to ["] if your just wanting the double quote.

(?:<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>(*SKIP)(*FAIL)|[<>"&'])

see example here -> https://regex101.com/r/OPsM1K/1

On non-PCRE type engines, the regex is altered by deleting the verbs, and capturing (matching) both
tags and entities in different groups.
This is a passive way of bypassing tags and at the same time matching entities.
It requires a search or a replace with a callback capability to determine which group
matched and act accordingly.
(this regex is not shown, if needed I'll include it).