0

i'm trying to use this RegEx search: <div class="ms3">(\n.*?)+<in Ruby, however as soon as i get to the last character "<" it stops working altogether. I've tested it in Rubular and the RegEx works perfectly fine, I'm using rubymine to write my code but i also tested it using Powershell and it comes up with the same results. no Error message. when i run <div class="ms3">(\n.*?)+ it prints <div class="ms3"> which is exactly what i'm looking for, but as soon as i add the "<" it comes out with nothing.

my code:

#!/usr/bin/ruby
# encoding: utf-8

File.open('ms3.txt', 'w') do |fo|
  fo.puts File.foreach('input.txt').grep(/<div class="ms3">(\n.*?)+/)
end

some of what i'm searching through:

  <div class="ms3">
    <span xml:lang="zxx"><span xml:lang="zxx">Still the tone of the remainder of the chapter is bleak. The</span> <span class="See_In_Glossary" xml:lang="zxx">DAY OF THE <span class="Name_Of_God" xml:lang="zxx">LORD</span></span> <span xml:lang="zxx">holds no hope for deliverance (5.16–18); the futility of offering sacrifices unmatched by common justice is once more underlined, and exile seems certain (5.21–27).</span></span>
  </div>

  <div class="Paragraph">
    <span class="Verse_Number" id="idAMO_5_1" xml:lang="zxx">1</span><span class="scrText">Listen, people of Israel, to this funeral song which I sing over you:</span>
  </div>

  <div class="Stanza_Break"></div>

The full RegEx i need to do is <div class="ms3">(\n.*?)+<\/div> it picks up the first section and nothing else

Rebs
  • 43
  • 7
  • 3
    [Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) – Biffen Nov 25 '14 at 12:44
  • 1
    Besides [Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066), you omit the **multiline** modifier: `...grep(/.../m)`. – Aleksei Matiushkin Nov 25 '14 at 12:52
  • i enjoyed that rant and shall keep it in mind. However i'm not 100% sure that what i'm doing is parsing(if i understand what parsing is correctly) all i want to do is extract certain bits of text containing HTML from one txt file to another, that's it. unless that's exactly what parsing is? i'n not trying to process the HTML or modify it, the end result will work without regex in another program altogether – Rebs Nov 25 '14 at 13:25
  • @RebekahParsons - if you know _exactly_ what you have, you can get away with Regex when extracting parts of an HTML, but it is very easy to break - imagine for example: `
    This is some text with
    sub div
    in the middle
    `
    – Uri Agassi Nov 25 '14 at 13:36
  • @UriAgassi ah i see what you mean, yes that could be very dodgy, thankfully i know exactly what i need and have different versions of the search to make up for the bits that don't match that i need. But that's a good point, so what would you use to do this as part of a program otherwise? – Rebs Nov 25 '14 at 13:39
  • When dealing with HTML, you should consider using an HTML parser, like http://nokogiri.org – Uri Agassi Nov 25 '14 at 13:44

1 Answers1

1

Your problem starts with using File.foreach('input.txt') which cuts the result into lines. This means that the pattern is matched to each line separately, so none of the lines match the pattern (by definition, none of the lines have \n in its middle).

You should have better luck reading the whole text as a block, and using match on it:

File.read('input.txt').match(/<div class="ms3">(\n.*?)+<\/div>/)
# => #<MatchData "<div class=\"ms3\">\n    <span xml:lang=\"zxx\">
# => <span xml:lang=\"zxx\">Still the tone of the remainder of the chapter is bleak. The</span> 
# => <span class=\"See_In_Glossary\" xml:lang=\"zxx\">DAY OF THE 
# => <span class=\"Name_Of_God\" xml:lang=\"zxx\">LORD</span></span> 
# => <span xml:lang=\"zxx\">holds no hope for deliverance (5.16–18); 
# => the futility of offering sacrifices unmatched by common justice is once more 
# => underlined, and exile seems certain (5.21–27).</span></span>\n  </div>" 1:"\n  ">
Uri Agassi
  • 36,848
  • 14
  • 76
  • 93