4

This a quite annoying but rather a much simpler task. According to this guide, I wrote this:

#!/bin/bash

content=$(wget "https://example.com/" -O -)
ampersand=$(echo '\&')

xmllint --html --xpath '//*[@id="table"]/tbody' - <<<"$content" 2>/dev/null |
    xmlstarlet sel -t \
        -m "/tbody/tr/td" \
            -o "https://example.com" \
            -v "a//@href" \
            -o "/?A=1" \
            -o "$ampersand" \
            -o "B=2" -n \

I successfully extract each link from the table and everything gets concatenated correctly, however, instead of reproducing the ampersand as & I receive this at the end of each link:

https://example.com/hello-world/?A=1\&amp;B=2

But actually, I was looking for something like:

https://example.com/hello-world/?A=1&B=2

The idea is to escape the character using a backslash \& so that it gets ignored. Initially, I tried placing it directly into -o "\&" \ instead of -o "$ampersand" \ and removing ampersand=$(echo '\&') in this case scenario. Still the same result.

Essentially, by removing the backslash it still outputs:

https://example.com/hello-world/?A=1&amp;B=2

Only that the \ behind the &amp; is removed.

Why?

I'm sure it is something basic that is missing.

Ava Barbilla
  • 968
  • 2
  • 18
  • 37

3 Answers3

6

&amp; is the correct way to print & in an XML document, but since you just want a plain URL your output should not be XML. Therefore you need to switch to text mode, by passing --text or -T to the sel command.

Your example input doesn't quite work because example.com doesn't have any table elements, but here is a working example building links from p elements instead.

content=$(wget 'https://example.com/' -O -)
xmlstarlet fo --html <<<"$content" |
    xmlstarlet sel -T -t \
        -m '//p[a]' \
            --if 'not(starts-with(a//@href,"http"))' \
              -o 'https://example.com/' \
            --break \
            -v 'a//@href' \
            -o '/?A=1' \
            -o '&' \
            -o 'B=2' -n

The output is

http://www.iana.org/domains/example/?A=1&B=2
npostavs
  • 4,877
  • 1
  • 24
  • 43
  • Hi @npostavs, it simplifies my script really well. The `--if` in my case is redundant because all the links to be extracted are missing the base url. Elsewise, it works great. Cheers! – Ava Barbilla Sep 17 '17 at 11:02
1

As you have already seen, backslash-escaping isn't the solution here. I can think of two possible options:

Extract the hrefs (probably don't need to be using both xmllint and xmlstarlet to do this), then just use a standard text processing tool such as sed to add the start and the end:

sed 's,^,https://example.com/,; s,$,/?A=1\&B=2,'

Alternatively, pipe the output of what you've currently got to xmlstarlet unesc, which will change &amp; into &.

Tom Fenech
  • 72,334
  • 12
  • 107
  • 141
0

Sorry I can't reproduce your result but why don't make substitutions? Just filter your results through

sed 's/\\&amp;/\&/g'

add it to your pipe. It should replace all &amp; to &.

vollitwr
  • 429
  • 2
  • 8