How to strip the & symbol from only the URLs in a file?

Question

I have a file, index.html, containing data like this:

<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-&-fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

I need to strip the & symbols from the URLs, such that "/bbq-spareribs-&-sauce-eat-lean-&-fat.html" becomes "/bbq-spareribs--sauce-eat-lean--fat.html". However, I do not wish to remove the & symbol from the parts of the file which are not URLs, such as the text of the link, bbq spareribs & sauce (eat lean & fat).

How would I accomplish this on a standard Linux install? It doesn't matter to me what specific tool/language is used to achieve the result so long as it works.

score 2 · Accepted Answer · edited May 23 '17 at 11:55

If you're happy to install BeautifulSoup, this simple Python script may do what you want:

#!/usr/bin/evn python
import sys
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(sys.stdin.read())
for a in soup.findAll("a"):
    a["href"] = a["href"].replace("&", "")

print soup

Example usage:

[me@home]$ cat your.html | python amp_remover.py
<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean--fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

Caveat: Since we're regenerating the output HTML based on a parsed representation of it, the formatting may change. Other possible changes include the explicit closing of tags if your markup is not well formed.

I may be wrong, but I suspect most solutions that use a proper XML/HTML parser will result in similar issues. To maintain the file exactly as it is and only remove the offending chars, you will have to end up using regex-based search and remove/replace. Many will advice against parsing XML/HTML with regex except for really trivial patterns. In your case, that may be true, but I'm yet to be convinced.

score 2 · Answer 2 · answered Sep 30 '11 at 17:32

If you are determined to use a simple command-line regex tool, and you know that your URL's are nice, and that the ampersands are used consistently in the text, you could try something like:

$ sed 's/\([^ \t]\)&\([^ \t]\)/\1\2/g' file.html > out.html

This presumes the URL has no whitespace adjacent to an ampersand, and that the ampersands not in the URL are always surrounded by spaces. So this is by no means robust, but it might be simpler than installing Beautiful Soup if you just need this once, and your html is predictable.

Chris · Answer 3 · 2011-09-30T20:48:52.673

just for completeness sake an awk-solution. It should be sufficiently stable for simple tasks.

File:

$ cat file
<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-&-fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs-&-sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

Output:

$ awk 'BEGIN{FS=OFS=">"}{for (i=1;i<=NF;i++){if ($i ~ "a href")gsub(/\&/,"",$i)}}1' file
<li><a href="/battered-fried-chicken-breast-no-skin.html">battered fried chicken breast, no skin</a></li>
<li><a href="/bbq-short-ribs-with-sauce.html">bbq short ribs with sauce</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean--fat.html">bbq spareribs & sauce (eat lean & fat)</a></li>
<li><a href="/bbq-spareribs--sauce-eat-lean-only.html">bbq spareribs & sauce (eat lean only)</a></li>

HTH Chris

score 0 · Answer 4 · answered Sep 30 '11 at 16:21

One route is to use a tool/language that has an XML package. That package would support easy access to the anchor element's href attribute in a programmatic fashion. So, you might have something like:

aElements = doc.getElement('a')

foreach aElement in aElements {
 string url = a.getHref()
 removeAmpersane ( url )
}

I'm sure that almost all language level tools have packages for this. If you are open to a heavy tool like a language, this will be easy for you. If you just want lower level linux tools, that's beyond my expertise.

score 0 · Answer 5 · answered Sep 30 '11 at 16:26

You could easily use javascript for this:

<head>

<script type="text/javascript">
  document.onload = (function (ev) {
    var links = document.getElementsByTagName('a');
    for (var i = 0; i < links.length; i++) {
      var href = links[i].href.replace(/(&)/, '');
      console.log(links[i]);
    }
  });
</script>

</head>

How to strip the & symbol from only the URLs in a file?

5 Answers5