1

I have a file, index.html, containing data like this:

<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-&-fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

I need to strip the & symbols from the URLs, such that "/bbq-spareribs-&-sauce-eat-lean-&-fat.html" becomes "/bbq-spareribs--sauce-eat-lean--fat.html". However, I do not wish to remove the & symbol from the parts of the file which are not URLs, such as the text of the link, bbq spareribs & sauce (eat lean & fat).

How would I accomplish this on a standard Linux install? It doesn't matter to me what specific tool/language is used to achieve the result so long as it works.

rps
  • 1,263
  • 3
  • 13
  • 18

5 Answers5

2

If you're happy to install BeautifulSoup, this simple Python script may do what you want:

#!/usr/bin/evn python
import sys
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(sys.stdin.read())
for a in soup.findAll("a"):
    a["href"] = a["href"].replace("&", "")

print soup

Example usage:

[me@home]$ cat your.html | python amp_remover.py
<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean--fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

Caveat: Since we're regenerating the output HTML based on a parsed representation of it, the formatting may change. Other possible changes include the explicit closing of tags if your markup is not well formed.

I may be wrong, but I suspect most solutions that use a proper XML/HTML parser will result in similar issues. To maintain the file exactly as it is and only remove the offending chars, you will have to end up using regex-based search and remove/replace. Many will advice against parsing XML/HTML with regex except for really trivial patterns. In your case, that may be true, but I'm yet to be convinced.

Community
  • 1
  • 1
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
2

If you are determined to use a simple command-line regex tool, and you know that your URL's are nice, and that the ampersands are used consistently in the text, you could try something like:

$ sed 's/\([^ \t]\)&\([^ \t]\)/\1\2/g' file.html > out.html

This presumes the URL has no whitespace adjacent to an ampersand, and that the ampersands not in the URL are always surrounded by spaces. So this is by no means robust, but it might be simpler than installing Beautiful Soup if you just need this once, and your html is predictable.

Eric Wilson
  • 57,719
  • 77
  • 200
  • 270
1

just for completeness sake an awk-solution. It should be sufficiently stable for simple tasks.

File:

$ cat file
<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-&-fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

Output:

$ awk 'BEGIN{FS=OFS=">"}{for (i=1;i<=NF;i++){if ($i ~ "a href")gsub(/\&/,"",$i)}}1' file
<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean--fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

HTH Chris

Chris
  • 2,987
  • 2
  • 20
  • 21
0

One route is to use a tool/language that has an XML package. That package would support easy access to the anchor element's href attribute in a programmatic fashion. So, you might have something like:

aElements = doc.getElement('a')

foreach aElement in aElements {
 string url = a.getHref()
 removeAmpersane ( url )
}

I'm sure that almost all language level tools have packages for this. If you are open to a heavy tool like a language, this will be easy for you. If you just want lower level linux tools, that's beyond my expertise.

chad
  • 7,369
  • 6
  • 37
  • 56
0

You could easily use javascript for this:

<head>

<script type="text/javascript">
  document.onload = (function (ev) {
    var links = document.getElementsByTagName('a');
    for (var i = 0; i < links.length; i++) {
      var href = links[i].href.replace(/(&)/, '');
      console.log(links[i]);
    }
  });
</script>

</head>
Adam Eberlin
  • 14,005
  • 5
  • 37
  • 49