2

The problem i'm facing is badly named links... There are few hundred bad links in different files.

So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com">
<a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like <a href="http://www.twitter.com>

I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.

How would recommend to do this? awk, sed, maybe python?? Will i need regex?

Thanks for opinions...

mrancys
  • 56
  • 7

2 Answers2

2

This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.

The following python regular expression would locate these links for you:

r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'

The pattern we look for is something inside a href="" chunk of text, where that 'something' starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.

The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.

If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:

import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')

# ...
redirects.sub(r'href="\1"', somehtmlstring)

Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Working with a HTML parser is less pain and less error-prone than using any kind of regular expressions - not recommendable. –  Aug 25 '12 at 10:55
  • Nope, don't agree. The knee-jerk reaction of 'lol, the OP is trying to parse HTML with regular expressions' does not always apply. Sometimes a simple job can be done with RE *provided you understand the tool*. – Martijn Pieters Aug 25 '12 at 10:56
  • i think this is a good use for regexes. OP isn't constructing or manipulating HTML just a single fixed pattern. – Tim Hoffman Aug 25 '12 at 11:18
  • I really appreciate, thanks, just got it working. One more thing to my todo list regex :) – mrancys Aug 25 '12 at 13:34
0

Use a HTML parser like BeautifulSoup or lxml.html.

  • 1
    +1. Bash, sed, awk and regular expressions are [not fit to parse html](http://stackoverflow.com/a/1732454/1524545) sanely. – geirha Aug 25 '12 at 10:52
  • No, but bash, sed, awk and re are fit to parse simple patterns. It may be that a full-blown HTML parser is not needed here; the OP is not trying to match whole tags. – Martijn Pieters Aug 25 '12 at 10:54
  • Yup i'm not parsing whole document. I just want track this specific pattern and delete it. What's all i wanna do. So simpler the better... – mrancys Aug 25 '12 at 11:21
  • I might suggest learning a proper HTML parser now, on a simple problem, so you understand how it works in the future for more complicated problems. – chepner Aug 25 '12 at 13:32
  • We are scaling website by making Wordpress to static website. Everything works ok, but tool generates static content makes those links. – mrancys Aug 25 '12 at 13:49