RegEx Is there a way to capture N group that are identical within an XML tag without reaching into the next identical tag?

Question

I have an XML file that would look something like this:

<Table>
  <Persons>
    <Person>
      <ID>71</ID>
      <FullNameLikeX>"sentence expected"</FullNameLikeX>
      <Age>49</Age>
      <FavoriteFood>Banana</FavoriteFood>
      <NameParts>
        <word>Jhon</word>
        <word>Henry</word>
        <word>Abbot</word>
      </NameParts>
    </Person>
    <Person>
      <ID>72</ID>
      <FullNameLikeX>"sentence expected"</FullNameLikeX>
      <Age>26</Age>
      <FavoriteFood>Cake</FavoriteFood>
      <NameParts>
        <word>Cecilia</word>
        <word>Elisabeth</word>
        <word>Maria</word>
        <word>Smith</word>
      </NameParts>
    </Person>
    <Person>
      <ID>73</ID>
      <FullNameLikeX>"sentence expected"</FullNameLikeX>
      <Age>17</Age>
      <FavoriteFood>Lasagna</FavoriteFood>
      <NameParts>
        <word>Luc</word>
        <word>Hernandez</word>
      </NameParts>
    </Person>
  </Persons>
</Table>

And i was trying to replace the "sentence expected" part by the actual sentence(For the first person here that would give: "Jhon Henry Abbot like Banana") using Regular expression in a text editor(Notepad++). My problem is I can't find a way to deal with the varying amount of "word" tag within the "NameParts" tag without a group ending up overreaching into the next "Person" tag or the group being empty.

Came up with this Regular Expression: (<FullNameLikeX>")[\s\S]*?("<\/FullNameLikeX>)([\s\S]*?<FavoriteFood>([\s\S]*?)<\/FavoriteFood>[\s\S]*?<NameParts>###[\s\S]*?<\/NameParts>)

Instead of ### i already tried placing multiple(from 1 to 4) of:

(?:[\s\S]*?<word>([\s\S]*?)<\/word>)? but group end-up reaching into the next Person when there are less word than this group count.

(?:[\s\S]*?<word>([\s\S]*?)<\/word>)?? it doesn't reach into next person but no group are being looked for.

(?:[\s\S]*?<word>([\s\S]*?)<\/word>)+? group end-up reaching into the next Person when there are less word than this group count.

(?:[\s\S]*?<word>([\s\S]*?)<\/word>(?![\s\S]*?<\/Person>[\s\S]*?))? it doesn't reach into next person but capture group are somehow empty.

So basically some group always either try to get 1 iteration even when they should not and end-up over-reaching into the next Person tag or they get 0 iteration when they should get 1.

Is there a way to capture an varying amount of XML Tag value without reaching into another Tag with just regular expression or it is just not possible ?

ps: This XML file is just a look a-like, the actual file is way longer and tag name and value are obscured, i replaced the tag name and value by simple one for clarity of reading but the format of the file stay the same.(Also it doesn't seem to have less than 1 "word" tag and no more than 5 per "NameParts" tag if it can actually help).

Why not using a XML library for XML parsing? While there may be some simple usecases for Regex, it generally isn't capable of parsing contextfree grammars (as XML). — derpirscher, May 07 '23 at 14:46
Use one one of `xidel`, `xmlstarlet`, `xmllint` and edit your question to add your **expected output**. — Gilles Quénot, May 07 '23 at 14:51
@derpirscher I'm not trying to parse the XML just replace the value of the FullNameLikeX tag by the combined value of some other tag, the solution would be simple if it wasn't for the varying amount of "word" tag within a "NameParts" tag. — Nidar, May 07 '23 at 15:08
@GillesQuénot the expected output should be what is in the first sentence after the XML code block. As for using those it doesn't solve the base regex issue that could also be a concern outside Xml(didn't explain it in the post in another way than with my XML problem since i can't find the words to describe the situation in a non-confusing way outside of XML). — Nidar, May 07 '23 at 15:12
**[Don't use `regex` to parse `HTML/XML`](https://stackoverflow.com/a/49352373/465183)** you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like `xidel`, `xmlstarlet` or `xmllint` if you need a quick shot from a command line shell. — Gilles Quénot, May 07 '23 at 15:14
SO is a professional developer QA website. It's not indented for end users apart if you know what you are doing, asking good question with proper tags and **proper tools.** — Gilles Quénot, May 07 '23 at 15:18
@GillesQuénot I'M NOT PARSING XML, just replacing a part of the text by a combination of other part of the text, the issue being the amount of other part may vary. The problem presented here doesn't only affect XML format it could also affect a plain text. The only difference between the XML file here and a similar problem in plain text is that in the XML all superfluous word are removed and that the punctuation is replaced by tag. The issue stay the same. — Nidar, May 07 '23 at 15:22
You cannot *replace* part of something without *parsing* that something first (because you have to find what to replace, and seemingly also the replacment is to be found somewhere else in your xml. What would you call that if not parsing?). And regex is not suitable for parsing xml ... — derpirscher, May 07 '23 at 21:04

Nidar · Answer 1 · 2023-05-07T18:34:53.173

[ANSWER]

Since people are just obsessed with their idea of using another tool and refuse to actually give an answer i continued to think for an answer by myself and found the solution which was simple. The error i commited in my attempt shown in the First post is that i used [\s\S]*? which allowed the group to outreach like it wanted. By using [^<]*? instead it prevent the group to pass to another tag. So the ### from my pattern i showed in 1st post must be replaced by:

(?:[^<]*?<word>([^<]*?)<\/word>)?

this patern must be repeated for at least the max amount of tag that can be found in a single tag else some tag value would be missing for some.

So for my XML File find:

(<FullNameLikeX>")[\s\S]*?("<\/FullNameLikeX>)([\s\S]*?<FavoriteFood>([\s\S]*?)<\/FavoriteFood>[\s\S]*?<NameParts>(?:[^<]*?<word>([^<]*?)<\/word>)?(?:[^<]*?<word>([^<]*?)<\/word>)?(?:[^<]*?<word>([^<]*?)<\/word>)?(?:[^<]*?<word>([^<]*?)<\/word>)?[\s\S]*?<\/NameParts>)

replace by:

$1$5 $6 $7 $8 like $4$2$3

Give the expected output. Which give me the following XML File for the example:

<Table>
  <Persons>
    <Person>
      <ID>71</ID>
      <FullNameLikeX>"Jhon Henry Abbot  like Banana"</FullNameLikeX>
      <Age>49</Age>
      <FavoriteFood>Banana</FavoriteFood>
      <NameParts>
        <word>Jhon</word>
        <word>Henry</word>
        <word>Abbot</word>
      </NameParts>
    </Person>
    <Person>
      <ID>72</ID>
      <FullNameLikeX>"Cecilia Elisabeth Maria Smith like Cake"</FullNameLikeX>
      <Age>26</Age>
      <FavoriteFood>Cake</FavoriteFood>
      <NameParts>
        <word>Cecilia</word>
        <word>Elisabeth</word>
        <word>Maria</word>
        <word>Smith</word>
      </NameParts>
    </Person>
    <Person>
      <ID>73</ID>
      <FullNameLikeX>"Luc Hernandez   like Lasagna"</FullNameLikeX>
      <Age>17</Age>
      <FavoriteFood>Lasagna</FavoriteFood>
      <NameParts>
        <word>Luc</word>
        <word>Hernandez</word>
      </NameParts>
    </Person>
  </Persons>
</Table>

Still if someone know a better way to achieve the same result(like for example not needing to make duplicate of the (?:[^<]*?<word>([^<]*?)<\/word>)? part to match max possible amount) it would be appreciated.

That has nothing to do with obsession, it has to do with using the right tool for the job. Yes, you may eventually be able to put a small nail into the wall with a screwdriver, but using a hammer makes much more sense.. And at latest once you have to put in a big nail in somewhere you are lost with the screwdriver. But as you refused to learn how to use a hammer with easy tasks, you won't be able to achieve complicated tasks with it. — derpirscher, May 07 '23 at 21:14

RegEx Is there a way to capture N group that are identical within an XML tag without reaching into the next identical tag?

1 Answers1