Help with Regexp from a xml (Tcl)

Question

I have an XML file.

 <?xml version="1.0"?>
 <catalog>
    <book id="bk101">
    </book>
 <catalog>

I read the file and store it in file_data

 set data [split $file_data "\n"]
 foreach line $data {
 regexp { book id=\"(.*)\" } $line all dummy
 puts $all
 puts $dummy
 }

So here as you can see I am trying to read the book id and print it out. I get the error dummy not found? Am I do it wrong?

Edit

Weirdly when I try this :

set mydata {<book id="bk101"> testing the code }
puts $mydata

regexp {book id="(.*)"} $mydata all part
puts $all
puts $part

Output

<book id="bk101"> testing the code
book id="bk101"
bk101

Have no idea the code at the top still shows error

Your problem is using regular expressions to try and parse XML. Instead of, you know, an XML parser. — Anon., Dec 08 '10 at 02:45
@Sii he's telling you this is not the right way of doing it, instead look around for an XML parser — Andreas Wong, Dec 08 '10 at 02:52
yeah I did go through the parser(TCLxml and tdom) but since this is just one xml file I feel that is an overkill at the moment and as I am also trying to get my head around "regexp" so thought why not try out a few things. — Sii, Dec 08 '10 at 02:55
note that the `-expanded` switch to the `regexp` command will ignore whitespace in the expression. good for writing readable regular expressions. http://www.tcl.tk/man/tcl8.5/TclCmd/regexp.htm — glenn jackman, Dec 08 '10 at 16:36

score 3 · Answer 1 · edited May 23 '17 at 12:04

3

Don't do that (though that question is about XHTML, it is no worse than any other XML dialect in this respect; plain HTML is if anything worse). In short, XML belongs to a class of languages that REs cannot fully parse.

Instead, use tDOM to parse the XML, and XPath (supported by tDOM) to pick out the interesting parts of the document.

package require tdom

# Get the XML here by whatever method, and parse it here...
set doc [dom parse $file_data]

# Iterate over the books in the document and print their IDs
foreach book [$doc selectNodes "//book"] {
    puts "book with id=[$book @id]"
}

# Tidy up at the end...
$doc delete

Using tDOM to do XML handling is easy. It's actually easier than using REs, and it's correct too. Double win!

edited May 23 '17 at 12:04

Community

1
1

answered Dec 08 '10 at 11:02

Donal Fellows

133,037
18
149
215

While I generally agree with Donal's sentiment, it may be worth noting that the text being "searched" for here (and the input document) might be simple enough that a regular expression could handle. If all he's doing is pulling out that small snippet from the document, and willing to agree that pulling out that text ignores any context, then it might be good enough. – RHSeeger Dec 09 '10 at 08:27
@RHSeeger: Yes, except that I find that it's still easier to use tDOM in that situation. It's that good. – Donal Fellows Dec 09 '10 at 10:33
Also, it's typical that, as soon as a decision-maker receives a working application built on a RE which parses the one simple XML fragment originally requested, he's only about eighteen seconds away from asking why it doesn't work in this other case, and, before long, the coder has found out he's trying to write a full-blown XML parser, all without ever intending to do so. And, yes, tDOM *is* that good. – Cameron Laird Dec 18 '10 at 15:07
More on tDOM: http://web.archive.org/web/20021003214308/http://www-106.ibm.com/developerworks/xml/library/x-tdom.html – Cameron Laird Dec 19 '10 at 12:37
@DonalFellows, is there a way I can replace that id value with something else? I could not find tDOM helpers to write/replace data – SandBag_1996 Sep 29 '17 at 16:01

score 2 · Accepted Answer · answered Dec 08 '10 at 04:50

The spaces in the RE are significant, and you place them around the original RE where there wouldn't be any expected. If you want to parse XML though, it might be best to use tdom or TclXML.

You should check that the result of regexp returns a non-zero answer (meaning it found something), otherwise 'dummy' won't get set, or will remain as was if previously set.

Bryan Oakley · Answer 3 · 2010-12-08T22:27:03.757

1

To answer your specific question, you have extra spaces in your regular expression. Look closely at this line of code:

regexp { book id=\"(.*)\" }

Notice the space before the word book. That is significant. You are asking regexp to find a sequence of characters that begins with a space, the literal word 'book', another space, etc. Your pattern doesn't match, in part because ' book' does not appear in the data.

edited Dec 08 '10 at 22:27

answered Dec 08 '10 at 20:53

Bryan Oakley

370,779
53
539
685

score 0 · Answer 4 · answered Dec 08 '10 at 10:08

2 Points:

If you are reading the data line by line, you need to check that regexp actually made a match before reading the variables
Jeff is right, and you have an extra whitespace at the beginning and end of your regexp


  set data [split $file_data "\n"] 
  foreach line $data {   
    if { [regexp {book id=\"(.*)\"} $line all dummy] } {
       puts $all
       puts $dummy   
    } 
  }

Another option you might consider, if you can do without XML, and control the data file format, you can easily create a format which is human readable, and tcl readable making your life much easier

catalog {
  book {
    { id "bk101" }
  }
}

etc. This is very easy to read as a tcl list, and interpret in the program

Help with Regexp from a xml (Tcl)

4 Answers4