0

I am trying to search for a phrase like this in HTTP response body:

>> myvar1
<HTML>
<HEAD> <TITLE>TestExample [Date]</TITLE></HEAD>
</HTML>

When I do this, I do not get any result:

>> myvar.scan(/<HEAD> <TITLE>TestExample [Date]<\/TITLE><\/HEAD>/)
[]

Here, [Date] is a dynamic variable that gets its value via loop iteration.

What should I add/change in the regex?


I am using Nokogiri to scan for keyword in HTTP response body.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Sunshine
  • 479
  • 10
  • 24
  • 1
    **Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See the nokogiri response below. – Andy Lester May 21 '13 at 04:51
  • @Andy Lester Thnx for the heads up. – Sunshine May 21 '13 at 05:52

3 Answers3

5

Please do not parse any markup like HTML with regular expressions. For such purposes it is much more maintainable to feed it into a proper SAX or DOM parser and just extract what you want that way. The reason for this is that no matter how clever you formulate your regex, there will always be corner cases you probably forgot.

require 'nokogiri'

response = "<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>"
doc = Nokogiri::HTML( response )


doc.css( "title" ).text
HamZa
  • 14,671
  • 11
  • 54
  • 75
Bjoern Rennhak
  • 6,766
  • 1
  • 16
  • 21
  • 3
    Be careful using `css('title')`. `css` returns a NodeSet, which acts like an Array. Instead, because you are searching for `title`, use `at` or one of its aliases to return the first Node that matches. – the Tin Man May 21 '13 at 04:34
  • Thanks @Bjoern. I tried using Nokogiri and getting error message now. Please see my update to question. – Sunshine May 21 '13 at 07:05
  • Looks like the mod removed my update. Basically, I added the Nokogiri check and once my code run reached doc = Nokogiri::HTML (response), it fails & error: NoMethodError undefined method `empty?' ... Any suggestions.? – Sunshine May 21 '13 at 14:49
  • You need to add require 'nokogiri' it seems the lib is not found. – Bjoern Rennhak May 21 '13 at 14:52
  • 1
    You can use irb to verify the code in the answer works. The error is somewhere else in your code. Please open a new question. – Bjoern Rennhak May 21 '13 at 14:58
  • Thanks @BjoernRennhak. You are correct. I fixed the code & used Nokogiri. It works fine. I just need one clarification on nokogiri. I know I can search specific page elements using it. But how to look for specific phrase / keyword. For example: in a – Sunshine May 22 '13 at 09:50
  • 2
    Nokogiri parses HTML/XML but does not boil down to javascript level unfortuantely. For that you would need to select each script node and use a regex to find what you are looking for. Here is a SO which discussed something similar http://stackoverflow.com/questions/14461931/ruby-nokogiri-javascript-parsing . – Bjoern Rennhak May 22 '13 at 11:36
  • 1
    Many Thanks @BjoernRennhak and everyone who responded to help. – Sunshine May 22 '13 at 12:25
0

This will work

<HEAD> <TITLE>TestExample (.*?)<\/TITLE><\/HEAD>

http://rubular.com/r/latepMqrjx

You probably don't need something as specific as <HEAD> <TITLE> as I doubt that there will be more than one title. Case sensitivity and newlines may also be an issue. I'd probably use

/<title>TestExample (.*?)<\//im
Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
  • What is the *actual* input? – Explosion Pills May 20 '13 at 22:34
  • Thanks n sorry. I msg too soon. Both of above return [["[Date]"]]. I am however trying to locate - TestExample [Date] - in the response body. It is part of an 'if' check - if (not res or not res.scan(TestExample [Date])) -> then fail action, else pass action. One thing here to notice besides regex is this 'Date' is actually a parameter which is coming from the loop in the beginning & assigning values with each pass. – Sunshine May 20 '13 at 22:44
  • Just don't use regex. While it'll work for simple tasks, it's too fragile for anything of moderate complexity and very likely to break if the page changes. A DOM parser is much more robust and more easily maintained. – the Tin Man Mar 19 '20 at 04:36
0

You're making it much too hard. Using Nokogiri, you can easily parse and search HTML and/or XML.

To get the <title> text simply use Nokogiri's HTML::Document#title method:

require 'nokogiri'

doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
doc.title # => "TestExample [Date]"

There's no regex to write or maintain, and this will work as long as the HTML is reasonably valid.

Since you're trying to get what looks like a template for a date, you'll probably want to rewrite that string, which Nokogiri also makes easy using title =:

require 'date'
require 'nokogiri'

doc = Nokogiri::HTML('<HTML> <HEAD> <TITLE>TestExample [Date]</TITLE></HEAD> </HTML>')
title = doc.title
title['[Date]'] = Date.today.to_s
doc.title = title
puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <title>TestExample 2020-03-18</title>
# >> </head> </html>
the Tin Man
  • 158,662
  • 42
  • 215
  • 303