3

I'm not sure how I'd select an title with regex. I've tried

match(/<title>(.*) .*<\/title>/)[1]

but that doesn't match anything.

This is the response body I'm trying to select from.

Trying to select "title I need to select."

ndnenkov
  • 35,425
  • 9
  • 72
  • 104
user3579614
  • 63
  • 2
  • 9
  • 3
    Parsing HTML with regexes only leads to unfortunate effects for the developer: http://stackoverflow.com/a/1732454/67392 – Richard Feb 01 '17 at 15:56
  • "If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine." That seems like what I'm trying to do. So it might be fine? – user3579614 Feb 01 '17 at 15:59
  • Also what would you recommend doing instead? – user3579614 Feb 01 '17 at 16:01
  • @user3579614 do you have any familiarity with JavaScript? It's similar in syntax and more suited for scraping HTML pages. – OneNeptune Feb 01 '17 at 16:03
  • Yeah a little bit. I've found nokogiri, that seems to do the job correctly? – user3579614 Feb 01 '17 at 16:12
  • @user3579614, Nokogiri is indeed the right tool for the job. Wrote an answer to explain how you can use it for your specific case. – ndnenkov Feb 01 '17 at 16:18
  • @OneNeptune, I wouldn't say Ruby is unfit for html parsing. Nokogiri works as well as any other industry standard parser. – ndnenkov Feb 01 '17 at 16:19
  • @ndn fair, I think highly of Ruby and just meant to suggest if it was a light weight utility not to reinvent the wheel. – OneNeptune Feb 01 '17 at 16:21
  • JavaScript is hardly more suitable for scraping, it's just different. A well implemented parser, like Nokogiri, is extremely powerful and convenient because it's designed for use with Ruby. – the Tin Man Feb 01 '17 at 20:56
  • Please read "[mcve]". When asking a question like this, you should supply the minimal HTML necessary to demonstrate the problem. While this particular problem results in a small amount of HTML, future problems you ask about probably won't be as simple and that supplied data will be more important. – the Tin Man Feb 01 '17 at 20:58

6 Answers6

2

The reason it doesn't work is because of the itemprop=\"name\" property. To fix this, you can match it as well:

# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'

html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."

.*? basically means "match as many characters are needed, but not more"


However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:

require 'nokogiri'

page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."

Note that it can handle even malformed html like is the case here.

ndnenkov
  • 35,425
  • 9
  • 72
  • 104
2

If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.

This post explains why Use xPath or Regex?

require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text
Community
  • 1
  • 1
jeremy04
  • 314
  • 1
  • 3
  • Better use CSS rather than Xpath, CSS is less error-prone. – akuhn Feb 01 '17 at 18:00
  • I'm interested in an example of 'error prone'? The syntax for XPath is designed for XML mostly, but works with XHTML. CSS selectors can break just as easily with a CSS class/id change, XPath can break easily with the structure of the HTML breaking. Pick your posion in that regard. What makes CSS better other than the 'syntax is easier'? – jeremy04 Feb 01 '17 at 18:51
  • `node.xpath("//foo")` does not select all `foo` descendants of `node`. About every other nokogiri question is someone tripping over that. I highly recommend CSS with its predictable behavior. – akuhn Feb 02 '17 at 05:51
1

Here's the regexp that will give you what you want: <title.*>(.*)<\/title>

As was mentioned, there are better ways to parse HTML. You might want to check out something like Nokogiri.

Kapitol
  • 133
  • 4
0

When I have to get elements from XML I like to convert it to a hash

from_xml(xml, disallowed_types = nil) public

Returns a Hash containing a collection of pairs when the key is the node name and the value is its content

# http://apidock.com/rails/Hash/from_xml/class

now you can do something like

hash = Hash.from_xml('XML')
hash.title # my favorite book
Jose Perez
  • 106
  • 1
  • 8
  • doesn't work for the OP's html, also you need to use rails or at least require 'active_support/all' – peter Feb 01 '17 at 16:44
0

One solution would be to use the following pattern:

<title.*?>(.*?)<\/title>

https://regex101.com/r/piwm5H/1

spencer.sm
  • 19,173
  • 10
  • 77
  • 88
0

Use a HTML/XML parser when dealing with XML or HTML data, except for extremely simple cases. HTML and XML are too complicated for normal regular expressions.

Using Nokogiri I'd do:

require 'nokogiri'

some_html = '
<html>
  <head>
    <title>the title</title>
  </head>
</html>
'

doc = Nokogiri::HTML(some_html)
doc.title # => "the title"

Nokogiri already has a method to return the title so you can take advantage of that. Or, you can do it the normal way:

doc.at('title').text  # => "the title"

The problem with a regular expression is that HTML could be written in many ways:

<title>foo</title>

or:

<title>
  foo
</title>

or even:

<title>foo
</head>

which, while not correct, will be accepted by browsers and fixed up by Nokogiri which will then still work. Writing a pattern to handle those variants is a pain and error-prone. It only gets worse as the HTML gets more complex, especially when you don't control the generation of the content.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303