How to use Nokogiri to change the HTML meta data?

Question

I am currently on a task to add a company name at the end of the meta-description in all .asp files. With a bit of knowledge about Nokogiri and Ruby, I think I should be able to automate this "Human-intensive" task. I can easily scrape through the asps and got a list of description to change. However, how can I change the value and write it back to the file?

I am trying to do this with Nokogiri but it seems like Nokogiri was designed to scrap data and write xml but not quite htmls. (The asps are fairly simple, just including some duplicated codes and no logic at all... so can be treated as html/text). Does Nokogiri provide this feature? If not, what else can I do? Thanks!

Post an example of one of your files: you can obfuscate the meta description and make the file 5-10 lines long for your question. Using your own regexes to parse html can quickly get very tricky, so Nokogiri is a fine option. — 7stud, Aug 01 '14 at 19:47
Nokogiri wasn't designed to scrap(e) data, it's an XML parser. HTML is XML with relaxed parsing rules so it does that too. It writes XML, XHTML and HTML, depending on what you ask for. — the Tin Man, Aug 01 '14 at 20:15
@7stud Thanks for your suggestions. Just follow on the Nokogiri path and got it nailed (by the Answer below though...). Thanks! — Quin, Aug 01 '14 at 21:38

the Tin Man · Accepted Answer · 2014-08-01T20:27:58.517

Nokogiri is excellent for this:

require 'nokogiri'

doc = Nokogiri::HTML.parse(<<EOT)
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
    <meta name="description" content="Free Web tutorials">
  </head>
  <body></body>
</html>
EOT

meta = doc.at('meta[@name]')
meta['content'] = 'foo'

puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <head>
# >>     <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
# >>     <meta name="description" content="foo">
# >>   </head>
# >>   <body></body>
# >> </html>

If you want to append something to the description's content:

meta['content'] = meta['content'] + ' by foobar'

Which results in:

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <head>
# >>     <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
# >>     <meta name="description" content="Free Web tutorials by foobar">
# >>   </head>
# >>   <body></body>
# >> </html>

HTML that you don't control can change in wild and wonderful ways if the creators change to different HTML generators. That can break your application unless you use something robust, and regular expressions for HTML are not robust enough.

It's easy to write a pattern to match

<meta name="description" content="Free Web tutorials">

It's not so easy to write one that matches that one day, and then

<meta 
name="description"

content="Free Web tutorials"
>

the next.

It's easy to imagine seeing various HTML output styles because the site's content people used different tools, along with some automation. A parser can handle it nicely.

Most detailed answer! I got stuck because originally I thought I have to write the content back by calling some Nokogiri methods. Too focus on that to a point that I've forgotten I am just changing the content in the memory and can simply write it to the file. — Quin, Aug 01 '14 at 21:41

score 0 · Answer 2 · edited May 23 '17 at 12:11

0

Open the file use a regex or String to identify the replacement and gsub it appropriately then write back to the File.

There are lots of solutions on SO for this Heres just one and a brief example

File.write("hello.txt",File.open("hello.txt",&:read).gsub("install","upgrade"))

this will replace every instance of the word "install" in "hello.txt" with "upgrade"

edited May 23 '17 at 12:11

Community

1
1

answered Aug 01 '14 at 19:31

engineersmnky

25,495
2
36
52

That will indiscriminately change it throughout the document, not just in the meta-description. – the Tin Man Aug 01 '14 at 20:16
@theTinMan It was not meant to be a exact answer since his question had no exact specifics when I answered. It was directional since he asked how can I do this. – engineersmnky Aug 04 '14 at 12:57

How to use Nokogiri to change the HTML meta data?

2 Answers2