-2

sorry for my bad english

My example text is html but the test must be applicable to any context

I have this regex : "<\b[D-d][I-i][V-v]\b([^>]*)>"

I want to complete it to exclude all > that are between quotes but I don't know how to do it

see my exemple below :

<div badAttribute="who put a > here?" class="exemple"> [....] </div>

the expected match is

<div badAttribute="who put a > here?" class="exemple">

[edit]

Another exemple : https://regex101.com/r/BQUENO/1

I have 2 keywords : start keyword '001' and end keyword '@' I want "all between 001 and @ but ignore 001 and @ that are between quotes "

I started this regex to exclude @ and all between quotes but it doesn't work fine

001("[^"]*")*([^@]*)*@

in my mind

("[^"]*")*

means "all between quotes (if exists)" but it doesn't work

exemple string

    001exemple@001@001Semper exitialis "fkjfk"cum subsidia ductor notissimus subsidia et ductor cui@
001Annonas et "@"et contumaciter conspectum@
001Quo amicissimos ad uxoriae certamen pecuniae tamen ="@" dirimi "klkj @"contentione nullam.@

can you explain to me how to do it?

Uliat
  • 31
  • 6
  • You probably mean `[Dd]` unless the intent is to match a single character in the range D, E, F, ..., X, Y, Z, a, b, c, d – tripleee Dec 12 '17 at 11:40
  • Like so often asked, answered and explained - don't use regex to parse html. That's not what regex was made for. Your task is very easy to accomplish with a HTML/XML parser – baao Dec 12 '17 at 11:42
  • Possible duplicate of [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not) – baao Dec 12 '17 at 11:43
  • my example text is html but the test must be applicable to any context – Uliat Dec 12 '17 at 12:57

1 Answers1

0

Depending on your regex dialect, something like this will skip double-quoted attribute values before the closing wedge.

<[Dd][Ii][vv]( [A-Za-z0-9_]*="[^"]*")* *>

The parenthesized expression ( [A-Za-z0-9_]="[^"]")* matches a space followed by an attribute name, an equals sign, a double quote, any amount of characters which are not double quote (which conveniently includes <and >), and a closing double quote. The asterisk after the parenthesis says to accept this zero or more times. I added the possibility to have a space after the final closing quote, too.

There is no way really to completely cover every variation in well-written HTML, let alone then real-world HTML, using regular expressions. Use an HTML parser if you need this to be robust, readable, accurate, and scalable.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Your new regex skips double-quoted strings immediately after `001` but your examples don't contain any so that's probably not what you want. Maybe allow `[^@]*` before quoted strings, too. – tripleee Dec 13 '17 at 13:23
  • 001(([^"@]*)*("[^"]*")*([^"@]*)*)*@ means starting keyword, then any group consisting of 3 subgroups (any characters except quote and end keyword + any characters in between quotes + any characters except quote and end keyword), finally keyword end. I am right ? – Uliat Dec 13 '17 at 14:31
  • You can make the groups "zero or more". Also, the parentheses around `[^@]*` are not really necessary or useful. So `001[^@]*("[^"]*"[^@]*)*@` might be closer to what you are looking for. – tripleee Dec 13 '17 at 14:39