I'm not sure how I'd select an title with regex. I've tried
match(/<title>(.*) .*<\/title>/)[1]
but that doesn't match anything.
This is the response body I'm trying to select from.
Trying to select "title I need to select."
I'm not sure how I'd select an title with regex. I've tried
match(/<title>(.*) .*<\/title>/)[1]
but that doesn't match anything.
This is the response body I'm trying to select from.
Trying to select "title I need to select."
The reason it doesn't work is because of the itemprop=\"name\"
property. To fix this, you can match it as well:
# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'
html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."
.*?
basically means "match as many characters are needed, but not more"
However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:
require 'nokogiri'
page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."
Note that it can handle even malformed html like is the case here.
If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.
This post explains why Use xPath or Regex?
require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text
Here's the regexp that will give you what you want:
<title.*>(.*)<\/title>
As was mentioned, there are better ways to parse HTML. You might want to check out something like Nokogiri.
When I have to get elements from XML I like to convert it to a hash
from_xml(xml, disallowed_types = nil) public
Returns a Hash containing a collection of pairs when the key is the node name and the value is its content
now you can do something like
hash = Hash.from_xml('XML')
hash.title # my favorite book
One solution would be to use the following pattern:
<title.*?>(.*?)<\/title>
Use a HTML/XML parser when dealing with XML or HTML data, except for extremely simple cases. HTML and XML are too complicated for normal regular expressions.
Using Nokogiri I'd do:
require 'nokogiri'
some_html = '
<html>
<head>
<title>the title</title>
</head>
</html>
'
doc = Nokogiri::HTML(some_html)
doc.title # => "the title"
Nokogiri already has a method to return the title so you can take advantage of that. Or, you can do it the normal way:
doc.at('title').text # => "the title"
The problem with a regular expression is that HTML could be written in many ways:
<title>foo</title>
or:
<title>
foo
</title>
or even:
<title>foo
</head>
which, while not correct, will be accepted by browsers and fixed up by Nokogiri which will then still work. Writing a pattern to handle those variants is a pain and error-prone. It only gets worse as the HTML gets more complex, especially when you don't control the generation of the content.