4

I am using a REXML Ruby parser to parse an XML file. But on a 64 bit AIX box with 64 bit Ruby, I am getting the following error:

REXML::ParseException: #<REXML::ParseException: #<RegexpError: Stack overflow in 
regexp matcher: 
/^<((?>(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*))\s*((?>\s+(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*\s*=\s*(["']).*?\3)*)\s*(\/)?>/mu>

The call for the same is something like this:

REXML::Document.new(File.open(actual_file_name, "r"))

Does anyone have an idea regarding how to solve this issue?

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
Ricketyship
  • 644
  • 2
  • 7
  • 22

2 Answers2

12

I've had several issues for REXML, it doesn't seem to be the most mature library. Usually I use Nokogiri for Ruby XML parsing stuff, it should be faster and more stable than REXML. After installing it with sudo gem install nokogiri, you can use something like this to get a DOM instance:

doc = Nokogiri.XML(File.open(actual_file_name, 'rb'))
# => #<Nokogiri::XML::Document:0xf1de34 name="document" [...] >

The documentation on the official webpage is also much better than that of REXML, IMHO.

Niklas B.
  • 92,950
  • 18
  • 194
  • 224
  • 1
    I want to know if this is a specific issue with ruby 64 bit. The same issue is not reproducible on 32 bit box. And if there is a work around for the same rather than installing some other library. – Ricketyship Jan 10 '12 at 04:36
  • @Bharath: The better place to report this would be the Ruby bugtracker, then. – Niklas B. Jan 10 '12 at 04:42
  • @Bharath: could also well be that your XML file is simply too large or too deeply nested. Maybe you should try some XML stream parser, in that case. – Niklas B. Jan 10 '12 at 04:45
  • Then why is that the same file gets parsed in 32 bit ruby? Ideally speaking, 64 bit ruby should be capable of all the activities done by a 32 bit ruby library. Is it that ruby 64 bit has some limitations? If so, is there any documentation for the same? – Ricketyship Jan 10 '12 at 04:54
  • @Barath: No, I rather think it's a bug. As I said, this wouldn't surprise me, most people use Nokogiri instead of REXML for some time now. – Niklas B. Jan 10 '12 at 04:58
  • The file has just a two level nesting. The file itself is just around 54kb. So not really sure what is causing this issue. – Ricketyship Jan 10 '12 at 04:59
  • But I have a sneaking suspicion that the problem is specific to aix box. On AIX box, i found the following line of code causing issue: word_wrap(decode_field(column[:column_name], @name), line_width = 800) – Ricketyship Jan 10 '12 at 05:11
  • The following error was being thrown: ActionView::TemplateError: too big quantifier in {,}: /(.{1,800})(\s+|$)/ ... when i dug deep into ruby code, i found that word_wrap was using a regex match. This was failing. even on irb, the same issue persisted. Finally I had to reduce the number of times a regex can repeat i.e {,} value. Then it worked. And the most important thing is that this was happening on 32 bit AIX box!!! So obviously, somewhere there is some restriction. – Ricketyship Jan 10 '12 at 05:15
6

I almost immediately found the answer.

The first thing I did was to search in the ruby source code for the error being thrown. I found that regex.h was responsible for this.

In regex.h, the code flow is something like this:

/* Maximum number of duplicates an interval can allow.  */
#ifndef RE_DUP_MAX
#define RE_DUP_MAX  ((1 << 15) - 1)
#endif

Now the problem here is RE_DUP_MAX. On AIX box, the same constant has been defined somewhere in /usr/include. I searched for it and found in

/usr/include/NLregexp.h
/usr/include/sys/limits.h
/usr/include/unistd.h

I am not sure which of the three is being used(most probably NLregexp.h). In these headers, the value of RE_DUP_MAX has been set to 255! So there is a cap placed on the number of repetitions of a regex!

In short, the reason is the compilation taking the system defined value than that we define in regex.h!

This also answers my question which i had asked recently: Regex limit in ruby 64 bit aix compilation

I was not able to answer it immediately as i need to have min of 100 reputation :D :D Cheers!

Community
  • 1
  • 1
Ricketyship
  • 644
  • 2
  • 7
  • 22