Using regex to find keyword in http response

Question

I asked a similar question earlier for which Nokogiri was recommended as a solution. I've used Nokogiri and it certainly works fine.

But due to certain reasons, I must use regex to extract a keyword from a HTTP response body.

Format of the keyword is as follows:

<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>

Here, Date is a dynamic variable, and I need to extract 'TestExample [Date]' from the HTTP response body. Also, <title> can be lower or upper case.

Assuming 'response' has the http response, I have tried doing the following:

>> response
=> "<HTML>\n<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>\n</HTML>"

Then make a regex to search:

>> regex
=> /<title>TestExample (.*?)<\/title>/mi

When I do a response[regex] there are no results. No results with response.match(regex) and response.scan(regex).

How can I do this task using regex?

Update:

For this task, this regex works fine:

response.match(/<title>(.*)<\/title>/mi).captures.first

I guess it is a typo, update "/title/mi" to "/title>TestExample (.*?)<\/title>/mi" — Thiago Lewin, Jun 10 '13 at 19:38
I'm lost. Why can't you use nokogiri to get the contents of ``, then regex search the contents? — 000, Jun 10 '13 at 19:39
@tlewin Yes, that was a typo. Thanks for noticing. I've been staring at the screen for too long. :) — Sunshine, Jun 10 '13 at 19:58
@JoeFrambach I know that's pretty straight forward. I have used Nokogiri in other tasks. But here I must use regex only. It's an ask from certain folks. — Sunshine, Jun 10 '13 at 20:01
Those "certain folks" shouldn't tell you how to write code then, because their method is wrong. — the Tin Man, Jun 10 '13 at 20:02
Guys I know you are correct. Just don't shoot the messenger. Their reasons are security related, but remember I am not an expert like you to have a debate over their call. — Sunshine, Jun 10 '13 at 20:08
`security related`!? Just show them this page: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1 — the Tin Man, Jun 10 '13 at 20:09

score 3 · Accepted Answer · answered Jun 10 '13 at 19:44

3

As other people said, Regex is not the way to go. If you're really bound to using Regexes (not just being too lazy to refactor?), this should do the trick:

response.match(/<title>(.*)<\/title>/mi).captures.first

answered Jun 10 '13 at 19:44

Patrick Oscity

53,604
17
144
168

Thanks for the answer. It sure works. :) ... I am noobie but not too lazy. :) ... as I mentioned earlier, I have used Nokogiri for other tasks but for this one, I must use regex only. Could you please tell about captures.first? – Sunshine Jun 10 '13 at 20:03
Ok then, no offense i was just trying to find out ;-) I'm just curoius: Why can't you use Nokogiri? – Patrick Oscity Jun 10 '13 at 20:07
1

`captures` gives you the capture groups, i.e. the parts of the regex enclosed in parentheses `(...)` as an array. `first` will give you the first element of the array. – Patrick Oscity Jun 10 '13 at 20:08
I've seen HTML with duplicated and multiple `` tags, which would cause this to behave badly, especially when the `` block was repeated after the body. – the Tin Man Jun 10 '13 at 20:30
@theTinMan I hear ya. Let me try to share some more info on this regex use. There is a web app running on a device. It's got a welcome page where user first lands in post log in. This url is pretty minimal in content & has certain standard keywords. The regex is to extract & match these keyword. It is not a frequently updated app, the target page has a pre-set content & most (dynamic) functionality lies on separate pages. Do you still see any issue with using regex? – Sunshine Jun 10 '13 at 20:39
@Sunshine the problem is not how complex your specific HTML is, regular expressions just aren't the right tool for this because in general **they cannot parse HTML**. Regular expressions can only parse regular languages and HTML is not such a language. [Here's a fun read about this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Patrick Oscity Jun 10 '13 at 20:58

score 2 · Answer 2 · answered Jun 10 '13 at 20:07

The correct way to handle this IS using a parser. Nokogiri will handle every requirement you stated, without breaking because of case differences or a difference in date.

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [Date]"

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TITLE>TestExample [1/1/2000]</TITLE></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [1/1/2000]"

doc = Nokogiri::HTML(<<EOT)
<HTML>
<HEAD> <TiTlE>TestExample [Jan. 1, 2000]</tItLe></HEAD>
</HTML>
EOT
doc.at('title').text
=> "TestExample [Jan. 1, 2000]"

doc.title
=> "TestExample [Jan. 1, 2000]"

just curious, can I locate the keyword if it is *anywhere* in the http response? — Sunshine, Jun 10 '13 at 20:42
The keyword? You mean tag? If it's in the parsed HTML body, yes. More importantly, it won't be fooled if `""` is in text somewhere, unlike a regex which would have a very hard time telling. — the Tin Man, Jun 10 '13 at 21:47

Casimir et Hippolyte · Answer 3 · 2013-06-10T21:21:44.960

1

You can try with this pattern too:

/(?<=<title>)[^<]++/i

[^<] means all characters but < (character class)
[^<]+ means 1 or more characters from this class
[^<]++ means 1 or more characters from this class, and be possessive

a possessive quantifier informs the regex engine that it doesn't need to backtrack, thus performances are better.

example:

response.match(/(?<=<title>)[^<]++/i)

the idea is to not use the dot and replace it by a character class that exclude <

Note that the result is the whole pattern, no need to use capture group here and no need to test what is coming after. I remove the m modifier (that stand for DOTALL) cause i don't use the dot.

I just control with a lookbehind that there's <title> before.

edited Jun 10 '13 at 21:21

answered Jun 10 '13 at 20:09

Casimir et Hippolyte

88,009
5
94
125

What is `[^<]++`? That's not part of Ruby's Regexp engine. – the Tin Man Jun 10 '13 at 20:28
>> response.match(/(?<=)[^<]++/i) SyntaxError: compile error (irb):3: undefined (?...) sequence: /(?<=<title>)[^<]++/ from (irb):3 – Sunshine Jun 10 '13 at 20:28
@Sunshine: Lookbehind are only available for ruby 1.9 – Casimir et Hippolyte Jun 10 '13 at 20:36
@theTinMan: The Regexp engine supports possessive quantifiers. – Casimir et Hippolyte Jun 10 '13 at 21:20
Ah, you're right. I searched the doc for `++` but, of course, they don't have an example, and the text describing it only shows a single `+`: `A quantifier followed by + matches possessively: once it has matched it does not backtrack. They behave like greedy quantifiers, but having matched they refuse to “give up” their match even if this jeopardises the overall match.` – the Tin Man Jun 10 '13 at 21:50

Using regex to find keyword in http response

3 Answers3