1

EDIT: I know how to do this. I'm not looking for a solution, I'm looking for a process or existing program recommendation before I take the time to write something myself in some scripting language.

I have some HTML files in various directories which all have a similar structure:

<html>
    <head>...</head>
    <body>
        <nav>...</nav>
        <section>...</section>
    </body>
</html>

I'd like to programmatically replace HTML sections with other sections (e.g. replace the <nav> block with a different nav block [specified in a file of my choosing]) for all the files I specify.

I think the ideal solution would be some sort of tool using lxml or something similar in Python, but if there were an easy way to do it with *nixy tools, or an existing program to do this, I'd be happy to do that instead of putting together a script.

Isaac
  • 15,783
  • 9
  • 53
  • 76
  • 2
    If you're comfortable with regular expressions, then something using [`sed`](http://en.wikipedia.org/wiki/Sed) might work. – Aya Apr 25 '13 at 13:12
  • post your attempt thus far, or you most likely won't be getting other peoples' attempts in each of those languages – jamylak Apr 25 '13 at 13:12
  • the best you can - master it with python's HTML library. Depends on your html, but in the general - html is not parseable with regex. (maybe, your input is possible). for the evocation you can use `find . -name \*.html -exec your_python_script {} \;` – clt60 Apr 25 '13 at 13:21
  • I'm not looking for a solution, I'm looking for process recommendation. I edited my question to make that clear. – Isaac Apr 25 '13 at 14:02
  • 1
    Please do not [use regular expressions to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). This is a very bad idea and anyone who suggests it is creating a world of hurt for someone. – tadman Apr 25 '13 at 14:35
  • "I know how to do this. I'm not looking for a solution, I'm looking for a process or existing program recommendation..." Then, you don't know how to do it really, otherwise you would understand that the problem is much too big and varied for a `sed` or any other string processing tool or "existing script", and, even if you got something to work it would be fragile. Even a `lxml`-type command-line tool won't do it because specifying the process for the tags you want to target, especially if there are multiple similar tags, would put you squarely into scripting, so you might as well begin there. – the Tin Man Apr 25 '13 at 14:52
  • Try typing `nokogiri -h` at the command prompt. I think you'll find the command-line interface of any tool like this is not flexible enough to do anything beyond a simple search and replace. – the Tin Man Apr 25 '13 at 17:35

3 Answers3

3

You might be able to use BeautifulSoup for Python like so.

import BeautifulSoup

soup = BeautifulSoup.BeautifulSoup(htmldata)
nav = soup.find("nav")
nav.name = "new name"

For example:

import BeautifulSoup

html_data = "<nav>Some text</nav>"
soup = BeautifulSoup.BeautifulSoup(html_data)
nav = soup.find("nav")
nav.name = "nav2"

Will change: <nav></nav> to <nav2></nav2>

Shannon Rothe
  • 1,112
  • 5
  • 15
  • 26
  • Thank you. I should have made myself clearer—I'm not looking for code to do this, I'm just trying to find the best tool or an existing script that does this. – Isaac Apr 25 '13 at 14:03
  • No worries—I gave you a +1 anyways! It could be helpful to someone coming along looking for something similar, and now we've got a Python and Ruby library listed. – Isaac Apr 25 '13 at 16:24
3

Don't use regex or string parsing. Those will only make your head hurt. Use a parser.

In Ruby I'd use Nokogiri:

require 'nokogiri'

html = '
<html>
  <body>
    <nav>...</nav>
    <section>...</section>
  </body>
</html>
'
doc = Nokogiri::HTML(html)

nav = doc.at('nav').content = "this is a new block"
puts doc.to_html

Which outputs:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
    <nav>this is a new block</nav><section>...</section>
</body></html>

Of course you'd want to replace "this is a new block" with something like File.read('snippet.html').

If your file of substitutions contains HTML snippets instead of the nav content, use this instead:

nav = doc.at('nav').replace('<nav>this is a new block</nav>')

The output would be the same. (And, again, use File.read to grab that from a file if that's how you lean.)

In Nokogiri, at finds the first instance of the tag specified by a CSS or XPath accessor and returns the Node. I used CSS above, but //nav would have worked also. at guesses at the type of accessor. You can use at_css or at_xpath if you want to be specific, because it's possible to have ambiguous accessors. Also, Nokogiri has search, which returns a NodeSet, which acts like an array. You can iterate over the results doing what you want. And, like at, there are CSS and XPath specific versions, css and xpath respectively.

Nokogiri has a CLI interface, and, for something as simple as this example it would work, but I could also do it in sed or a Ruby/Perl/Python one-liner.

curl -s http://nokogiri.org | nokogiri -e'p $_.css("h1").length'

HTML is seldom this simple though, especially anything that is found roaming the wilds, and a CLI or one-liner solution will rapidly grow out of control, or simply die. I say that based on years of writing many spiders and RSS aggregators -- what starts out simple grows a lot more complex when you introduce an additional HTML or XML source, and it never gets easier. Using parsers taught me to go to them first.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
2

I ended up writing my own little command-line tool to do what I wanted. It works fairly well for my use cases, and I intend to improve on it over time. It's on GitHub: trufflepig.

I hope it can be of use to others as well.

Isaac
  • 15,783
  • 9
  • 53
  • 76