Bad named links search and replace

Question

The problem i'm facing is badly named links... There are few hundred bad links in different files.

So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com">
<a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like <a href="http://www.twitter.com>

I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.

How would recommend to do this? awk, sed, maybe python?? Will i need regex?

Thanks for opinions...

What do you want to do with the links once you find them? – Martijn Pieters Aug 25 '12 at 10:47 — Martijn Pieters, Aug 25 '12 at 10:47
Just delete them and leave direct link. – mrancys Aug 25 '12 at 11:30 — mrancys, Aug 25 '12 at 11:30

Martijn Pieters · Accepted Answer · 2012-08-25T13:03:54.213

This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.

The following python regular expression would locate these links for you:

r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'

The pattern we look for is something inside a href="" chunk of text, where that 'something' starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.

The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.

If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:

import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')

# ...
redirects.sub(r'href="\1"', somehtmlstring)

Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.

Working with a HTML parser is less pain and less error-prone than using any kind of regular expressions - not recommendable. — , Aug 25 '12 at 10:55
Nope, don't agree. The knee-jerk reaction of 'lol, the OP is trying to parse HTML with regular expressions' does not always apply. Sometimes a simple job can be done with RE *provided you understand the tool*. — Martijn Pieters, Aug 25 '12 at 10:56
i think this is a good use for regexes. OP isn't constructing or manipulating HTML just a single fixed pattern. — Tim Hoffman, Aug 25 '12 at 11:18
I really appreciate, thanks, just got it working. One more thing to my todo list regex :) — mrancys, Aug 25 '12 at 13:34

score 0 · Answer 2 · answered Aug 25 '12 at 10:47

0

Use a HTML parser like BeautifulSoup or lxml.html.

answered Aug 25 '12 at 10:47

1

+1. Bash, sed, awk and regular expressions are [not fit to parse html](http://stackoverflow.com/a/1732454/1524545) sanely. – geirha Aug 25 '12 at 10:52
No, but bash, sed, awk and re are fit to parse simple patterns. It may be that a full-blown HTML parser is not needed here; the OP is not trying to match whole tags. – Martijn Pieters Aug 25 '12 at 10:54
Yup i'm not parsing whole document. I just want track this specific pattern and delete it. What's all i wanna do. So simpler the better... – mrancys Aug 25 '12 at 11:21
I might suggest learning a proper HTML parser now, on a simple problem, so you understand how it works in the future for more complicated problems. – chepner Aug 25 '12 at 13:32
We are scaling website by making Wordpress to static website. Everything works ok, but tool generates static content makes those links. – mrancys Aug 25 '12 at 13:49

Bad named links search and replace

2 Answers2