2

I need an regex to find <Field ...name="document"> or <FieldArray ...name="document"> to replace with an empty string. They can be defined across multiple lines.

This is not html or xhtml, it's just a text string containing <Field> and <FieldArray>

Example with Field:

      <Field
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />

Example with FieldArray:

      <FieldArray
        component={FormField}
        typeInput="selectAutocomplete"
        name="document"
      />

the are inside a list of components. Example:

      <Field
        name="amount"
        component={FormField}
        label={t('form.amount')}
      />
      <Field
        name="datereception"
        component={FormField}
        label={t('form.datereception')}
      />
      <Field
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />
      <Field
        name="datedeferred"
        component={FormField}
        label={t('form.datedeferred')}
      />

I've have read some solutions like to find src in Extract image src from a string but his structure is different a what i'm looing for.

beercohol
  • 2,577
  • 13
  • 26
DDave
  • 1,400
  • 3
  • 16
  • 33

3 Answers3

2

It is not advisable to parse [X]HTML with regex. If you have a possibility to use a domparser, I would advise using that instead of regex.

If there is no other way, you could this approach to find and replace your data:

<Field(?:Array)?\b(?=[^\/>]+name="document")[^>]+\/>

Explanation

  • Match <Field with optional "Array" and end with a word boundary <Field(?:Array)?\b
  • A positive lookahead (?=
  • Which asserts that following is not /> and encounters name="document" [^\/>]+name="document"
  • Match not a > one or more times [^>]+
  • Match \/>

var str = `<Field
    name="amount"
    component={FormField}
    label={t('form.amount')}
  />
  <Field
    name="datereception"
    component={FormField}
    label={t('form.datereception')}
  />
  <Field
    component={FormField}
    name="document"
    typeInput="selectAutocomplete"
  />
  <Field
    name="datedeferred"
    component={FormField}
    label={t('form.datedeferred')}
  />
<FieldArray
    component={FormField}
    typeInput="selectAutocomplete"
    name="document"
  /><FieldArray
    component={FormField}
    typeInput="selectAutocomplete"
    name="document"
  />` ;
str = str.replace(/<Field(?:Array)?\b(?=[^\/>]+name="document")[^>]+\/>/g, "");
console.log(str);
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • i did not test your code in mine, but i think it's going to work, my code is not xhtml or html, just component tags – DDave Dec 19 '17 at 12:17
  • Given the generous lookahead here, your optional `(?:Array)?` doesn't do anything. maybe you intended to have a `\b` after it to denote the end of that tag? Also, your `[\s\S]+?` (nongreedy expansion) is expensive. Why not use `[^>]+` instead? [`]+name="document")[^>]+\/>`](https://regex101.com/r/hiDuwk/3). You might also be interested in using [template literals](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals) for multi-line strings to clean up that example. I'm not sure why there's a -1 on this answer, it looks good to me. – Adam Katz Dec 19 '17 at 16:05
  • @DDave – It looks like your code is XML, which has the same issue. You're still better off using an actual XML parser. DOM parsers can handle this. – Adam Katz Dec 19 '17 at 16:11
  • @AdamKatz Thank you for your comment! I have updated my answer. – The fourth bird Dec 19 '17 at 17:29
  • You may not believe this, but it's not good enough to use `[^>]`. Your regex matches `` which is valid html but does not contain the `name="document"` attrib/value. –  Dec 19 '17 at 18:31
  • @sln – Yes, parsing XML/HTML with a regex can be hazardous to your health and should be [avoided](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/). The [spec](https://www.w3.org/TR/html5/introduction.html#a-quick-introduction-to-html) doesn't address all corner cases (such as a `"` inside an attribute string enclosed with the same `"` character). Using `[^>]*` is generally Good Enough™, especially for parsing generated code that has no risk of being meddled with by an attacker. See also [my non-regex answer](https://stackoverflow.com/a/47892283/519360) to this question. – Adam Katz Dec 19 '17 at 21:44
  • @AdamKatz - tag parsing is _not_ html parsing. It is the first operation to get to the html instructions, the _can opener_. It's the same with any language first level parsing. `"(such as a " inside an attribute string enclosed with the same " character)."` - No, there is no such thing in html/xml. Quotes are a simple thing here, the very next quote terminates the string. Most entities that need to be inserted are encoded as such ( &quote; etc...). Any text existing inside a tag that does not terminate/start tags are strays with no meaning. `(?s)(?:".*?"|'.*?'|[^>]*?)+` passive the only way . –  Dec 19 '17 at 23:21
2

Here's an answer with actual XML parsing and no regular expressions:

var xml = document.createElement("xml");
xml.innerHTML = `
      <Field
        name="amount"
        component={FormField}
        label={t('form.amount')}
      />
      <FieldDistractor
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />
      <Field
        name="datereception"
        component={FormField}
        label={t('form.datereception')}
      />
      <Field
        component={FormField}
        name="document"
        typeInput="selectAutocomplete"
      />
      <Field
        name="datedeferred"
        component={FormField}
        label={t('form.datedeferred')}
      />
      <FieldArray
        component={FormField}
        typeInput="selectAutocomplete"
        name="document"
      /><FieldArray
        component={FormField}
        typeInput="selectAutocomplete"
        name="document"
      />
`;

var match = xml.querySelectorAll(
  `field:not([name="document"]), fieldarray:not([name="document"]),
    :not(field):not(fieldarray)`
);
var answer = "";
for (var m=0, ml=match.length; m<ml; m++) {
  // cloning the node removes children, working around the DOM bug
  answer += match[m].cloneNode().outerHTML + "\n";
}
console.log(answer);

In writing this answer, I found a bug in the DOM parser for both Firefox (Mozilla Core bug 1426224) and Chrome (Chromium bug 796305) that didn't allow creating empty elements via innerHTML. My original answer used regular expressions to pre- and post-process the code to make it work, but using regexes on XML is so unsavory that I later changed it to merely strip off children by using cloneNode() (with its implicit deep=false).

So we dump the XML into a dummy DOM element (which we don't need to place anywhere), then we run querySelectorAll() to match some CSS that specifies your requirements:

  • field:not([name="document"]) "Field" elements lacking name="document" attributes, or
  • fieldarray:not([name="document"]) "FieldArray" elements lacking that attribute, or
  • :not(field):not(fieldarray) Any other element
Adam Katz
  • 14,455
  • 5
  • 68
  • 83
  • This `[^>]` by itself isn't sufficient to parse html tags. –  Dec 19 '17 at 18:32
  • I removed the regex code and used a non-regex workaround rather than dealing with ridiculously arcane XML-parsing issues (which are [the reason for avoiding regexes](https://stackoverflow.com/a/1732454/3195314) in the first place). – Adam Katz Dec 19 '17 at 21:03
  • Yeah but nobody's talking about parsing XML/Xhtml/html. The issue is parsing _tags_ or markup. Note that the given specs by w3c are written using regex to begin with. A typical use is a sax parser. Incase you don't think regex can be used, you can take a look at this which strips all html markup and invisible content from any html source: https://regex101.com/r/4jvwsH/1 –  Dec 19 '17 at 23:04
  • This is not a bug in either Chrome's or Firefox's DOM Parser. There are a limited number of empty elements in HTML, HTML is not XML. – Robert Longson Dec 20 '17 at 13:46
0

You can parse HTML tags with regex because parsing the tags themselves are nothing special and are the first thing parsed as an atomic operation.

But, you can't use regex to go beyond the atomic tag.
For example, you can't find the balanced tag closing to match the open as
this would put a tremendous strain on regex capability.

What a Dom parser does is use regex to parse the tags, then uses internal
algorithms to create a tree and carry out processing instructions to interpret
and recreate an image.
And of course regex doesn't do that.

Sticking to strictly parsing tags, including invisible content (like script),
is not that easy as well.
Content can hide or embed tags that, when you look for them, you shouldn't
find them.

So, in essence, you have to parse the entire html file to find the real
tag your looking for.
There is a general regex that can do this that I will not include here.
But if you need it let me know.

So, if you want to jump straight into the fire without parsing all the
tags of the entire file, this is the regex to use.

It is essentially a cut up version of the one that parses all tags.
This flavor finds the tag and any attribute=value that you need,
and also finds them out-of-order.
It can also be used to find out-of-order, multiple attr/val's within the same tag.

This is for your usage:

/<Field(?:Array)?(?=(?:[^>"']|"[^"]*"|'[^']*')*?\sname\s*=\s*(?:(['"])\s*document\s*\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+\/>/

Explained/Formatted

 < Field                # Field or  FieldArray  tag
 (?: Array )?

 (?=                    # Asserttion (a pseudo atomic group)
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s name \s* = \s* 
      (?:
           ( ['"] )               # (1), Quote
           \s* document \s*       # With name = "document"
           \1 
      )
 )
 \s+ 
 (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]*? )+
 />

Running demo: https://regex101.com/r/ieEBj8/1

  • Dave - This is grade A stuff. If I were you I'd write it down so you don't lose it .. –  Dec 18 '17 at 22:48
  • thanks sln i'm going to study your code. my code is not full html, it's just a string containin Field and FieldArray, i did not understand what do you mean with 'write dow,' – DDave Dec 19 '17 at 12:14
  • @DDave - If it were just a string containing Field and FieldArray then you can't tell where they begin and end compared to something else without using delimiter parsing rules. Especially when you're looking for a specific attribute / value (or ah, sub-expression I mean). Don't think you're fooling anybody. What I mean by _write it down_ is, this regex form is a gold standard I developed years ago and has been used for big scraping projects. I disseminate it freely, but I don't often fully explain it (by design). This is custom for you, different for someone else, etc.. –  Dec 23 '17 at 19:52