4

I have searched across a number of Q&As and can't find a solution specific enough to help.

I have a large xml file and need to do a conditional 'remove' in one field depending on the value in another field.

For example:

<vehicle>...<manufacturer>JCB</manufacturer>....<item_category>JCB Tractors</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>....<item_category>Digger</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>....<item_category>Caterpillar Digger</item_category>...</vehicle>

needs to become

<vehicle>...<manufacturer>JCB</manufacturer>...<item_category>Tractors</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>...<item_category>Digger</item_category>...</vehicle><vehicle>...<manufacturer>Caterpillar</manufacturer>....<item_category>Digger</item_category>...</vehicle>

Ideally the solution would be something I could apply using find and replace functionality in textpad set to POSIX extended regex.

Really appreciate help on this one as I have been banging my head against it for a while!

If I use a parser, I can isolate the variable string I want to 'remove' using

(?<=<manufacturer>)(.*?)(?=<\/manufacturer>)

Is it possible to use that pattern to isolate the string I actually want to remove

e.g.,

(?<=<item_category>)(?<=<manufacturer>)(.*?)(?=<\/manufacturer>)(\s)
Eric
  • 43
  • 4
  • 3
    Use an XML parser instead of regex is my advice. Could you please explain more what do you want to achieve – Amen Jlili Apr 23 '15 at 13:43
  • Thanks for replying - I am trying to keep large (1GB) xml files intact, and fix 'corrupted' category fields that have occasionally been populated with the manufacturer string as well as the category – Eric Apr 23 '15 at 13:53
  • Yes. Your example is not clear though. – Amen Jlili Apr 23 '15 at 13:56
  • Hi - I have adjusted the examples. I have just shown the relevant fields, the 'dots' represent a number of other fields within the record. – Eric Apr 23 '15 at 14:23

1 Answers1

2

Suggestions that you use a parser are spot on.

Dealing with tags in regex can be a nightmare. Some programs fail at regex patterns in large text files and start corrupting the bits. Make sure you back up your work first.

But I simultaneously saw an opportunity to have some fun with this. This is only possible because the Manufacturer name is the same as the first part of the item_category.

DEMO: https://regex101.com/r/rO7pM0/1

Explanation

(\<manufacturer>([^<]*)<\/manufacturer>)(\s*)(\<item_category>)(?:\2\s*)?([^<]*)(<\/item_category>)

Explanation:

 (                            # Opens CG1
     \<manufacturer>          # Literal 
     (                        # Opens CG2
         [^<]*                # Negated Character class (excludes the characters within)
                                # None of: <
                                # * repeats zero or more times
     )                        # Closes CG2
     <                        # Literal <
     \/                       # Literal /
     manufacturer             # Literal manufacturer
     >                        # Literal >
 )                            # Closes CG1
 (                            # Opens CG3
     \s*                      # Token: \s (white space)
                                # * repeats zero or more times
 )                            # Closes CG3
 (                            # Opens CG4
     \<item_category>         # Literal 
 )                            # Closes CG4
 (?:                          # Opens NCG
     \2                       # A backreference to CG2
     \s*                      # Token: \s (white space)
                                # * repeats zero or more times
 )?                           # Closes NCG
                                # ? repeats zero or one times
 (                            # Opens CG5
     [^<]*                    # Negated Character class (excludes the characters within)
                                # None of: <
                                # * repeats zero or more times
 )                            # Closes CG5
 (                            # Opens CG6
     <                        # Literal <
     \/                       # Literal /
     item_category            # Literal item_category
     >                        # Literal >
 )                            # Closes CG6

Changing (\s*), which in the demo equates to the space between the two tags, to ([\s\S]*?) should handle all the tags within that your question isn't providing but that requires that every vehicle tag have a manufacturer and item_category tag. If it does not, you end up with corrupted data, which is one reason why the parser is the better solution.

Regular Jo
  • 5,190
  • 3
  • 25
  • 47
  • Thanks so much for this. My XML has no nesting and the fields are in a consistent order and always present so this approach should be ok. – Eric Apr 26 '15 at 10:11
  • I can get the regex working well in tools such as https://regex101.com/ however in textpad it doesn't seem to work. Breaking down the regex to just look at (\([^<]*)<\/manufacturer>)([\s\S]*?)(\) it works fine to find the capture group 1 and 2, however doesn't cope with CG 3 and 4 - any thoughts? – Eric Apr 26 '15 at 10:19