conditional remove with variable string regex

Question

I have searched across a number of Q&As and can't find a solution specific enough to help.

I have a large xml file and need to do a conditional 'remove' in one field depending on the value in another field.

For example:

<vehicle>...<manufacturer>JCB</manufacturer>....<item_category>JCB Tractors</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>....<item_category>Digger</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>....<item_category>Caterpillar Digger</item_category>...</vehicle>

needs to become

<vehicle>...<manufacturer>JCB</manufacturer>...<item_category>Tractors</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>...<item_category>Digger</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>....<item_category>Digger</item_category>...</vehicle>

Ideally the solution would be something I could apply using find and replace functionality in textpad set to POSIX extended regex.

Really appreciate help on this one as I have been banging my head against it for a while!

If I use a parser, I can isolate the variable string I want to 'remove' using

(?<=<manufacturer>)(.*?)(?=<\/manufacturer>)

Is it possible to use that pattern to isolate the string I actually want to remove

e.g.,

(?<=<item_category>)(?<=<manufacturer>)(.*?)(?=<\/manufacturer>)(\s)

Use an XML parser instead of regex is my advice. Could you please explain more what do you want to achieve — Amen Jlili, Apr 23 '15 at 13:43
Thanks for replying - I am trying to keep large (1GB) xml files intact, and fix 'corrupted' category fields that have occasionally been populated with the manufacturer string as well as the category — Eric, Apr 23 '15 at 13:53
Hi - I have adjusted the examples. I have just shown the relevant fields, the 'dots' represent a number of other fields within the record. — Eric, Apr 23 '15 at 14:23

Regular Jo · Accepted Answer · 2015-04-26T08:20:45.973

Suggestions that you use a parser are spot on.

Dealing with tags in regex can be a nightmare. Some programs fail at regex patterns in large text files and start corrupting the bits. Make sure you back up your work first.

But I simultaneously saw an opportunity to have some fun with this. This is only possible because the Manufacturer name is the same as the first part of the item_category.

DEMO: https://regex101.com/r/rO7pM0/1

Explanation

(\<manufacturer>([^<]*)<\/manufacturer>)(\s*)(\<item_category>)(?:\2\s*)?([^<]*)(<\/item_category>)

Explanation:

 (                            # Opens CG1
     \<manufacturer>          # Literal 
     (                        # Opens CG2
         [^<]*                # Negated Character class (excludes the characters within)
                                # None of: <
                                # * repeats zero or more times
     )                        # Closes CG2
     <                        # Literal <
     \/                       # Literal /
     manufacturer             # Literal manufacturer
     >                        # Literal >
 )                            # Closes CG1
 (                            # Opens CG3
     \s*                      # Token: \s (white space)
                                # * repeats zero or more times
 )                            # Closes CG3
 (                            # Opens CG4
     \<item_category>         # Literal 
 )                            # Closes CG4
 (?:                          # Opens NCG
     \2                       # A backreference to CG2
     \s*                      # Token: \s (white space)
                                # * repeats zero or more times
 )?                           # Closes NCG
                                # ? repeats zero or one times
 (                            # Opens CG5
     [^<]*                    # Negated Character class (excludes the characters within)
                                # None of: <
                                # * repeats zero or more times
 )                            # Closes CG5
 (                            # Opens CG6
     <                        # Literal <
     \/                       # Literal /
     item_category            # Literal item_category
     >                        # Literal >
 )                            # Closes CG6

Changing (\s*), which in the demo equates to the space between the two tags, to ([\s\S]*?) should handle all the tags within that your question isn't providing but that requires that every vehicle tag have a manufacturer and item_category tag. If it does not, you end up with corrupted data, which is one reason why the parser is the better solution.

Thanks so much for this. My XML has no nesting and the fields are in a consistent order and always present so this approach should be ok. — Eric, Apr 26 '15 at 10:11
I can get the regex working well in tools such as https://regex101.com/ however in textpad it doesn't seem to work. Breaking down the regex to just look at (\([^<]*)<\/manufacturer>)([\s\S]*?)(\) it works fine to find the capture group 1 and 2, however doesn't cope with CG 3 and 4 - any thoughts? — Eric, Apr 26 '15 at 10:19

conditional remove with variable string regex

1 Answers1