1

I am attempting to replace:

   <td id="logo_divider"><a href="http://www.the-site.com"><img src=
   "/ART/logo.140.gif" width="140" height="84" alt="logo" border=
   "0" id="logo" name="logo" /></a></td>

with:

   <td id="logo_divider"><span itemscope itemtype="http://schema.org/Organization"><a itemprop="url" href="http://www.the-site.com"><img itemprop="logo" src=
   "/ART/logo.140.gif" width="140" height="84" alt="logo" border=
   "0" id="logo" name="logo" /></a></span></td>

The sed command I've written:

sed -E s#\(\<td id=\"logo_divider\"\>\)\(\<a \)\(href=\"http://www\.the-site\.com\"\>\<img \)\(src=\n\"/ART/logo\.140\.gif\".*?\n.*?\>\)#\1\<span itemscope itemtype=\"http://schema\.org/Organization\"\>\2itemprop=\"url\"\3itemprop=\"logo\"\4\</span\>\5#g default.ctp

There are two problems. The first is the command fails with:

sed: 1: "s#(<td": unterminated substitute pattern

The second is that, even if it were to succeed, matching needs to be robust to line breaks. A more robust solution would first remove any line breaks between:

<td id="logo_divider">

and:

</td>

Then execute the replacement against the cleaned file. Something like:

sed -E s#\n##g | ...
kayaker243
  • 2,580
  • 3
  • 22
  • 30
  • 6
    The time it will take to craft a regular expression that exactly matches the text as it appears in your file would be better spent learning how to do this using a proper HTML parser in the language of your choice. `sed` is not the right tool for editing such markup languages. – chepner Oct 11 '13 at 19:14
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Gilles Quénot Oct 11 '13 at 19:35
  • Ahahah. This is great. I consider myself enlightened! – kayaker243 Oct 11 '13 at 19:47

1 Answers1

3

As chepner says, use the right tool for the right job.

If you have any Python, I'd recommend Beautiful Soup -- relatively simple to get what you want (this is rude and crude, but you get the idea assuming you've got the above source in somefile.html):

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("./somefile.html"))

td = soup.find('td',id='logo_divider')
anchor = td.find('a')
anchor['itemprop'] = 'url'
span = soup.new_tag('span')
span['itemscope'] = True
span['itemtype'] = 'http://schema.org/Organization'
spanchild = anchor.replace_with(span)
span.append(spanchild)
Community
  • 1
  • 1
jstevenco
  • 2,913
  • 2
  • 25
  • 36
  • HTML5 evaluates an element to be true if it is present. So: `` is valid. `itemscope=True` (result of your script) obviously works, but isn't how this value is typically set. Is there a way to cause BeautifulSoup to add itemscope without a value? – kayaker243 Oct 13 '13 at 22:53
  • And thanks a ton for the pointer. Super useful and intuitive. – kayaker243 Oct 13 '13 at 22:57
  • 1
    Hmmm... don't think so; at least, I don't see that sort of option available in the documentation. If you *really* want that, you *could* go the `sed` route or equivalent to post-process the `itemscope="True"` items to `itemscope`. – jstevenco Oct 13 '13 at 23:10
  • After performing the necessary operations, I save out the file. The only way I can identify to do this is by calling prettify. Unfortunately, Beautiful Soup wants to close all my meta tags in the head, which isn't actually required. Is there any way to prevent this behavior? I'd prefer if the only changes introduced were those explicitly requested... is this an unavoidable consequence of the advantages provided by an html parser? – kayaker243 Oct 14 '13 at 16:48