-1

Not 100% sure what I am doing wrong. Unfortunately I need to parse XML with regex and not Beautiful soup or other. This is supposed to replace Match with comment

CODE:

import re, shutil

TAG_NAME = 'ruleDefinition'
CASE_LABEL  = 'case'

word_file = 'MetaData.txt'
xml_file  = 'ProcessAll.xml'
extension = '.bak'
backup    = xml_file + extension

#  Create backup
shutil.copy2(xml_file, backup)

with open(word_file) as words:
    regex = r'<[^>]+ field=$"({})"[^>]+>'.format(
        '|'.join(
            sorted((word.rstrip('\r\n') for word in words), key=len, reverse=True)
        )
    )

with open(xml_file, 'w') as new_xml:
    with open(backup) as xml:
        names = []
        start = False
        entry = ''
        for line in xml:
            # start tag
            if re.findall(r'<{}[^>]*>'.format(TAG_NAME), line):
                start = True
            # end tag
            if '</{}'.format(TAG_NAME) in line:
                start = False
                if names:
                    new_xml.write('<!-- Removed ' + ','.join(names) + ' -->\n')
                    names = []
            # inside tag
            if start:
                if len(entry):
                    entry += line
                if '<{}'.format(CASE_LABEL) in line:
                    entry += line
                if '</{}'.format(CASE_LABEL) in line:
                    match = re.search(regex, entry)
                    if match:
                        name = match.group(1)
                        names.append(name)
                    else:
                        new_xml.write(entry)
                    entry = ''
                    continue
                if len(entry):
                   continue
            new_xml.write(line)

File called in script:

cat MetaData.txt
NORMALIZED_PRICE_REALTIME
STAMP_DUTY_FLAG_REALTIME

XML FILE:

         <ruleDefinition name="ProcessAllFields" category="subrule" defaultContext="Security">
               <case label="0xFBDE">
<!-- dec=64478, NORMALIZED_PRICE_REALTIME -->
<if>
 <or>
 <equal op1="$temp.updateAlways" op2="true"/>
 <equal op1="#NORMALIZED_PRICE_REALTIME" op2="0"/>
</or>
<then>
    <multiply op1="$inField.data" op2="$temp.pScale"      
    store="$NORMALIZED_PRICE_REALTIME" round="-3"/>
<appendField field="$NORMALIZED_PRICE_REALTIME"/>
   </then>
          </if>
  </case>
  <case label="0xFBDF">
 <!-- dec=64479, STAMP_DUTY_FLAG_REALTIME -->
  <if>
 <or>
         <equal op1="$temp.updateAlways" op2="true"/>
         <equal op1="#STAMP_DUTY_FLAG_REALTIME" op2="0"/>
 </or>
 <then>
         <assign to="$STAMP_DUTY_FLAG_REALTIME" from="$inField.data"/>
         <appendField field="$STAMP_DUTY_FLAG_REALTIME"/>
 </then>
           </if>
   </case>
    </ruleDefinition>

Basically it is deleted all data with the string 'case'

DESIRED RESULT
<--Removed NORMALIZED_PRICE_REALTIME, STAMP_DUTY_FLAG_REALTIME --/>

Super_Py_Me
  • 169
  • 1
  • 4
  • 14

1 Answers1

0

Unfortunately I need to parse XML with regex

That's a bit like saying that unfortunately you need to turn iron into gold. It can't be done, so don't waste your time. The reason it can't be done is that regular expressions are only powerful enough to parse regular languages (that's a technical term, not a value judgement), and XML is not a regular language. If you attempt it, then your code will have bugs: that's an inevitable consequence of computer science theory.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164