0

I asked a similar question earlier for which Nokogiri was recommended as a solution. I've used Nokogiri and it certainly works fine.

But due to certain reasons, I must use regex to extract a keyword from a HTTP response body.

Format of the keyword is as follows:

<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>

Here, Date is a dynamic variable, and I need to extract 'TestExample [Date]' from the HTTP response body. Also, <title> can be lower or upper case.

Assuming 'response' has the http response, I have tried doing the following:

>> response
=> "<HTML>\n<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>\n</HTML>"

Then make a regex to search:

>> regex
=> /<title>TestExample (.*?)<\/title>/mi

When I do a response[regex] there are no results. No results with response.match(regex) and response.scan(regex).

How can I do this task using regex?


Update:

For this task, this regex works fine:

response.match(/<title>(.*)<\/title>/mi).captures.first
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Sunshine
  • 479
  • 10
  • 24
  • 1
    I guess it is a typo, update "/title/mi" to "/title>TestExample (.*?)<\/title>/mi" – Thiago Lewin Jun 10 '13 at 19:38
  • 2
    I'm lost. Why can't you use nokogiri to get the contents of ``, then regex search the contents? – 000 Jun 10 '13 at 19:39
  • @tlewin Yes, that was a typo. Thanks for noticing. I've been staring at the screen for too long. :) – Sunshine Jun 10 '13 at 19:58
  • @JoeFrambach I know that's pretty straight forward. I have used Nokogiri in other tasks. But here I must use regex only. It's an ask from certain folks. – Sunshine Jun 10 '13 at 20:01
  • do you have a *real* reason? – 000 Jun 10 '13 at 20:02
  • 1
    Those "certain folks" shouldn't tell you how to write code then, because their method is wrong. – the Tin Man Jun 10 '13 at 20:02
  • Guys I know you are correct. Just don't shoot the messenger. Their reasons are security related, but remember I am not an expert like you to have a debate over their call. – Sunshine Jun 10 '13 at 20:08
  • 1
    `security related`!? Just show them this page: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 – the Tin Man Jun 10 '13 at 20:09

3 Answers3

3

As other people said, Regex is not the way to go. If you're really bound to using Regexes (not just being too lazy to refactor?), this should do the trick:

response.match(/<title>(.*)<\/title>/mi).captures.first
Patrick Oscity
  • 53,604
  • 17
  • 144
  • 168
  • Thanks for the answer. It sure works. :) ... I am noobie but not too lazy. :) ... as I mentioned earlier, I have used Nokogiri for other tasks but for this one, I must use regex only. Could you please tell about captures.first? – Sunshine Jun 10 '13 at 20:03
  • Ok then, no offense i was just trying to find out ;-) I'm just curoius: Why can't you use Nokogiri? – Patrick Oscity Jun 10 '13 at 20:07
  • 1
    `captures` gives you the capture groups, i.e. the parts of the regex enclosed in parentheses `(...)` as an array. `first` will give you the first element of the array. – Patrick Oscity Jun 10 '13 at 20:08
  • I've seen HTML with duplicated and multiple `` tags, which would cause this to behave badly, especially when the `` block was repeated after the body. – the Tin Man Jun 10 '13 at 20:30
  • @theTinMan I hear ya. Let me try to share some more info on this regex use. There is a web app running on a device. It's got a welcome page where user first lands in post log in. This url is pretty minimal in content & has certain standard keywords. The regex is to extract & match these keyword. It is not a frequently updated app, the target page has a pre-set content & most (dynamic) functionality lies on separate pages. Do you still see any issue with using regex? – Sunshine Jun 10 '13 at 20:39
  • @Sunshine the problem is not how complex your specific HTML is, regular expressions just aren't the right tool for this because in general **they cannot parse HTML**. Regular expressions can only parse regular languages and HTML is not such a language. [Here's a fun read about this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Patrick Oscity Jun 10 '13 at 20:58
2

The correct way to handle this IS using a parser. Nokogiri will handle every requirement you stated, without breaking because of case differences or a difference in date.

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [Date]"

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TITLE>TestExample [1/1/2000]</TITLE></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [1/1/2000]"

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TiTlE>TestExample [Jan. 1, 2000]</tItLe></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [Jan. 1, 2000]"

doc.title
=> "TestExample [Jan. 1, 2000]"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • just curious, can I locate the keyword if it is *anywhere* in the http response? – Sunshine Jun 10 '13 at 20:42
  • The keyword? You mean tag? If it's in the parsed HTML body, yes. More importantly, it won't be fooled if `""` is in text somewhere, unlike a regex which would have a very hard time telling. – the Tin Man Jun 10 '13 at 21:47
1

You can try with this pattern too:

/(?<=<title>)[^<]++/i

[^<] means all characters but < (character class)
[^<]+ means 1 or more characters from this class
[^<]++ means 1 or more characters from this class, and be possessive

a possessive quantifier informs the regex engine that it doesn't need to backtrack, thus performances are better.

example:

response.match(/(?<=<title>)[^<]++/i)

the idea is to not use the dot and replace it by a character class that exclude <

Note that the result is the whole pattern, no need to use capture group here and no need to test what is coming after. I remove the m modifier (that stand for DOTALL) cause i don't use the dot.

I just control with a lookbehind that there's <title> before.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • What is `[^<]++`? That's not part of Ruby's Regexp engine. – the Tin Man Jun 10 '13 at 20:28
  • >> response.match(/(?<=)[^<]++/i) SyntaxError: compile error (irb):3: undefined (?...) sequence: /(?<=<title>)[^<]++/ from (irb):3 – Sunshine Jun 10 '13 at 20:28
  • @Sunshine: Lookbehind are only available for ruby 1.9 – Casimir et Hippolyte Jun 10 '13 at 20:36
  • @theTinMan: The Regexp engine supports possessive quantifiers. – Casimir et Hippolyte Jun 10 '13 at 21:20
  • Ah, you're right. I searched the doc for `++` but, of course, they don't have an example, and the text describing it only shows a single `+`: `A quantifier followed by + matches possessively: once it has matched it does not backtrack. They behave like greedy quantifiers, but having matched they refuse to “give up” their match even if this jeopardises the overall match.` – the Tin Man Jun 10 '13 at 21:50