2

I often find myself needing a tool that would allow me to:

search for multiple multi-line regex patterns in a large file and replace them using back-referencing.

Should I:

  1. take the 2 hours it'll require to build myself such a tool
  2. use something someone has already built (please suggest)
  3. learn to use a language that's particularly good at this type of thing(Perl?)

Example
I have an xml document containing thousands of entries. There are about 100 entries with a known value field which need to be removed. I can build a regular expression for each entry. The expression is the same for the 100 entries except for the value string part. Either this tool would need to be able to loop through once for every value or just once with 100 OR terms (|) in the expression (it would be huge). In this case I'm replacing the matches with a blank but in other cases, I'd reformat the text and re-insert the value field.

Chad Birch
  • 73,098
  • 23
  • 151
  • 149
srmark
  • 7,942
  • 13
  • 63
  • 74

4 Answers4

2

I reckon you should write the thing in Python. The python re library is great:

# get the re library
import re

# this is the line to process
xml_line = "<stuff><bad i_am_naughty=\"True\"></bad></stuff>"
# compile a regex 
exp = re.compile ("(.*)(<bad.*bad>)(.*)")
# run the regex on the line
match = exp.search (xml_line)
# print out the groups the regex found
print match.groups ()

N.B. You could also use python XML parsing libraries to strip out the elements you don't want. Using the python XMl parsing simplifies some of the complexity that I have ignored in my example (multiple lines etc). In lieu of a Python XML parsing example this question has some good answers re parsing XML in Python.

Community
  • 1
  • 1
RedBlueThing
  • 42,006
  • 17
  • 96
  • 122
1

I am not quite sure what your data looks like, but I would consider writing the tool in python in three passes:

  1. convert the file of XML path plus variable = value to lines of XML.path.variable=value
  2. apply massive regex to each line, possibly deleting line from output
  3. convert shortened list of XML.path.variable=value lines back to XML
hughdbrown
  • 47,733
  • 20
  • 85
  • 108
0

I would suggest not to use regular expression. XML should usually be handled with XML tools. Why not just use XSLT?

Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
  • I think he wants to use regexes to make the expression for choosing one of the values easy to construct. The alternative in XSLT would have about 100 matching templates, right? – hughdbrown Apr 02 '09 at 01:33
0

There're large number of module that could be used to handle xml.

http://www.crummy.com/software/BeautifulSoup/ http://codespeak.net/lxml/