Using regex to get title

Question

I'm not sure how I'd select an title with regex. I've tried

match(/<title>(.*) .*<\/title>/)[1]

but that doesn't match anything.

This is the response body I'm trying to select from.

Trying to select "title I need to select."

Parsing HTML with regexes only leads to unfortunate effects for the developer: http://stackoverflow.com/a/1732454/67392 — Richard, Feb 01 '17 at 15:56
"If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine." That seems like what I'm trying to do. So it might be fine? — user3579614, Feb 01 '17 at 15:59
@user3579614 do you have any familiarity with JavaScript? It's similar in syntax and more suited for scraping HTML pages. — OneNeptune, Feb 01 '17 at 16:03
Yeah a little bit. I've found nokogiri, that seems to do the job correctly? — user3579614, Feb 01 '17 at 16:12
@user3579614, Nokogiri is indeed the right tool for the job. Wrote an answer to explain how you can use it for your specific case. — ndnenkov, Feb 01 '17 at 16:18
@OneNeptune, I wouldn't say Ruby is unfit for html parsing. Nokogiri works as well as any other industry standard parser. — ndnenkov, Feb 01 '17 at 16:19
@ndn fair, I think highly of Ruby and just meant to suggest if it was a light weight utility not to reinvent the wheel. — OneNeptune, Feb 01 '17 at 16:21
JavaScript is hardly more suitable for scraping, it's just different. A well implemented parser, like Nokogiri, is extremely powerful and convenient because it's designed for use with Ruby. — the Tin Man, Feb 01 '17 at 20:56
Please read "[mcve]". When asking a question like this, you should supply the minimal HTML necessary to demonstrate the problem. While this particular problem results in a small amount of HTML, future problems you ask about probably won't be as simple and that supplied data will be more important. — the Tin Man, Feb 01 '17 at 20:58

ndnenkov · Answer 1 · 2017-02-01T16:23:17.537

The reason it doesn't work is because of the itemprop=\"name\" property. To fix this, you can match it as well:

# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'

html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."

.*? basically means "match as many characters are needed, but not more"

However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:

require 'nokogiri'

page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."

Note that it can handle even malformed html like is the case here.

score 2 · Answer 2 · edited May 23 '17 at 11:53

2

If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.

This post explains why Use xPath or Regex?

require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text

edited May 23 '17 at 11:53

Community

1
1

answered Feb 01 '17 at 16:17

jeremy04

314
1
3

Better use CSS rather than Xpath, CSS is less error-prone. – akuhn Feb 01 '17 at 18:00
I'm interested in an example of 'error prone'? The syntax for XPath is designed for XML mostly, but works with XHTML. CSS selectors can break just as easily with a CSS class/id change, XPath can break easily with the structure of the HTML breaking. Pick your posion in that regard. What makes CSS better other than the 'syntax is easier'? – jeremy04 Feb 01 '17 at 18:51
`node.xpath("//foo")` does not select all `foo` descendants of `node`. About every other nokogiri question is someone tripping over that. I highly recommend CSS with its predictable behavior. – akuhn Feb 02 '17 at 05:51

score 1 · Answer 3 · answered Feb 01 '17 at 16:12

1

Here's the regexp that will give you what you want: <title.*>(.*)<\/title>

As was mentioned, there are better ways to parse HTML. You might want to check out something like Nokogiri.

answered Feb 01 '17 at 16:12

Kapitol

133
4

This will select a whole lot more if there are two `title` tags on that page. – ndnenkov Feb 01 '17 at 16:13

score 0 · Answer 4 · answered Feb 01 '17 at 16:09

0

When I have to get elements from XML I like to convert it to a hash

from_xml(xml, disallowed_types = nil) public

Returns a Hash containing a collection of pairs when the key is the node name and the value is its content

# http://apidock.com/rails/Hash/from_xml/class

now you can do something like

hash = Hash.from_xml('XML')
hash.title # my favorite book

answered Feb 01 '17 at 16:09

Jose Perez

106
1
8

doesn't work for the OP's html, also you need to use rails or at least require 'active_support/all' – peter Feb 01 '17 at 16:44

score 0 · Answer 5 · answered Feb 01 '17 at 17:30

0

One solution would be to use the following pattern:

<title.*?>(.*?)<\/title>

https://regex101.com/r/piwm5H/1

answered Feb 01 '17 at 17:30

spencer.sm

19,173
10
77
88

the Tin Man · Answer 6 · 2017-02-01T20:54:23.533

Use a HTML/XML parser when dealing with XML or HTML data, except for extremely simple cases. HTML and XML are too complicated for normal regular expressions.

Using Nokogiri I'd do:

require 'nokogiri'

some_html = '
<html>
  <head>
    <title>the title</title>
  </head>
</html>
'

doc = Nokogiri::HTML(some_html)
doc.title # => "the title"

Nokogiri already has a method to return the title so you can take advantage of that. Or, you can do it the normal way:

doc.at('title').text  # => "the title"

The problem with a regular expression is that HTML could be written in many ways:

<title>foo</title>

or:

<title>
  foo
</title>

or even:

<title>foo
</head>

which, while not correct, will be accepted by browsers and fixed up by Nokogiri which will then still work. Writing a pattern to handle those variants is a pain and error-prone. It only gets worse as the HTML gets more complex, especially when you don't control the generation of the content.

Using regex to get title

6 Answers6