Combine multiple sed commands

Question

having the following file:

<tr class="in">
  <th scope="row">In</th>
  <td>1.2 kB/s (0.0%)</td>
  <td>8.3 kB/s (0.0%) </td>
  <td>3.2 kB/s (0.0%) </td>
</tr>
<tr class="out">
  <th scope="row">Out</th>
  <td>6.7 kB/s (0.6%) </td>
  <td>4.2 kB/s (0.1%) </td>
  <td>1.5 kB/s (0.6%) </td>
</tr>

I want to get the values between each second <td></td> (and save it to a file) like this:

8.3
4.2

My code so far:

# get the lines with <td> tags
cat tmp.txt | grep '<td>[0-9]*.[0-9]' > tmp2.txt

# delete whitespaces
sed -i 's/[\t ]//g' tmp2.txt

# remove <td> tag
cat tmp2.txt | sed "s/<td>//g" > tmp3.txt

# remove "kB/s (0.0%)"
cat tmp3.txt | sed "s/kB\/s\((.*)\)//g" > tmp4.txt

# remove </td> tag and save to traffic.txt
cat tmp4.txt | sed "s/<\/td>//g" > traffic.txt

#rm -R -f tmp*

How can I do this the common way? This code is really noobish..

Thanks in Advance, Marley

The preferred way to remove whitespace is either `tr -d [:blank:]` or `tr -d ' \t'` — William Pursell, May 31 '12 at 12:33
You can save a lot of forks by avoiding useless uses of cat: `sed "..." tmpY.txt` works! — Jens, May 31 '12 at 12:49
Don't parse HTML or XML with shell tools. Use XML technology, like Xpath, which is much better suited. — Jens, May 31 '12 at 12:52

score 17 · Accepted Answer · edited Aug 24 '21 at 10:29

Use the -e option (if using GNU sed). From the manual:

e [command] This command allows one to pipe input from a shell command into pattern space. Without parameters, the e command executes the command that is found in pattern space and replaces the pattern space with the output; a trailing newline is suppressed.

If a parameter is specified, instead, the e command interprets it as a command and sends its output to the output stream. The command can run across multiple lines, all but the last ending with a back-slash.

In both cases, the results are undefined if the command to be executed contains a NUL character.

Note that, unlike the r command, the output of the command will be printed immediately; the r command instead delays the output to the end of the current cycle.

So in your case you could do:

cat tmp.txt | grep '<td>[0-9]*.[0-9]' \
| sed -e 's/[\t ]//g' \
-e "s/<td>//g" \
-e "s/kB\/s\((.*)\)//g" \
-e "s/<\/td>//g" > traffic.txt

You can also write it in another way as:

grep "<td>.*</td>" tmp.txt | sed 's/<td>\([0-9.]\+\).*/\1/g'

The \+ matches one or more instances, but it does not work on non-GNU versions of sed. (Mac has BSD, for example)

With help from @tripleee's comment below, this is the most refined version I could get which will work on non-GNU versions of sed as well:

sed -n 's/<td>\([0-9]*.[0-9]*\).*/\1/p' tmp.txt

As a side note, you could also simply pipe the outputs through each sed instead of saving each output, which is what I see people generally do for ad-hoc tasks:

  cat tmp.txt | grep '<td>[0-9]*.[0-9]' \
    | sed -e 's/[\t ]//g' \
    | sed "s/<td>//g" \
    | sed "s/kB\/s\((.*)\)//g" \
    | sed "s/<\/td>//g" > traffic.txt

The -e option is more efficient, but the piping option is more convenient I guess.

You can do away with the `cat`, e.g. `grep '...' tmp.txt | ...` — Shawn Chin, May 31 '12 at 10:40
You could do away with the `grep` too; `sed -e '/[0-9]*.[0-9]/!d' -e ... tmp.txt >traffic.txt` — tripleee, May 31 '12 at 10:51
An alternative of multiple `-e` is a single one with successive commands separated by `;` — mouviciel, May 31 '12 at 11:03

score 3 · Answer 2 · answered May 31 '12 at 12:02

3

This might work for you (GNU sed):

 sed '/^<tr/,/^<\/tr>/!d;/<td/H;/^<\/tr/!d;x;s/\n//g;s/<td>/\n/2;s/.*\n\(\S*\).*/\1/' file

Explanation:

Focus on lines between start <tr> and end </tr> tags. /^<tr/,/^<\/tr>/!d
Store <td> lines in the hold space (HS). /<td/H
Delete all lines in range except the last. /^<\/tr/!d
Swap to HS. x
Delete all newlines. s/\n//g
Replace 2nd <td> with a newline. s/<td>/\n/2
Delete all text in the HS except for the first non-space field following the inserted newline and print. s/.*\n\(\S*\).*/\1/

answered May 31 '12 at 12:02

potong

55,640
6
51
83

I don't see anything in your expression that's unique to GNU sed except for the use of `\S`, which could be expressed portably as `[^[:space:]]`. – Barton Chittenden May 31 '12 at 12:44
@BartonChittenden I'm erring on the side of conservative as I know that the above solution works using GNU sed on a Linux box whereas it might not work for a BSD or Mac or whatever. – potong May 31 '12 at 13:18
try running it with the `--posix` flag; this will disable all GNU extensions. – Barton Chittenden May 31 '12 at 13:33

score 2 · Answer 3 · answered May 31 '12 at 12:32

You can use braces to create a block which is operated on by an address or set of addresses:

sed -n '/<td>[0-9]*.[0-9]/ {s/[\t ]//g; s/<td>//g; s/kB\/s\((.*)\)<\/td>//g;p}' tmp.txt

I think that you can probably do something tricky with sed's hold and pattern spaces in order to get the second and 4th lines, (I've seen solutions which can undo double-spacing of files this way).

jam · Answer 4 · 2012-05-31T15:02:23.303

1

[Edit] Thanks to Barton for pointing out the mistake. Corrected version:

cat tmp.txt | grep td | sed 's/<td>\([0-9]\.[0-9]\).*/\1/g' > newtmp.txt
sed -n '2,${p;n;n}' newtmp.txt > final.txt; rm newtmp.txt

The first line will pick out the digit.digit pattern after td on each line.

The second line prints every third line starting from the second line (which effectively gives you the second line out of every group of three in the file).

edited May 31 '12 at 15:02

answered May 31 '12 at 10:42

jam

3,640
5
34
50

Hm. I just tried this, but the `sed 'd;n;d;' newtmp.txt` part doesn't give any output. – Barton Chittenden May 31 '12 at 13:03
Oops, the command I tried on my system was in fact 'n;d;n'. You are quite right that 'd;n;d' gives no output... Will see if I can edit my answer with something that works as requested, as 'n;d;n' gives every other line. Good spot! – jam May 31 '12 at 14:17

William Pursell · Answer 5 · 2012-05-31T12:55:00.067

Your questions about running multiple sed appear to have been answered, but sed is the wrong tool for this. Assuming the input format is rigid, and <tr> is always at the start of a line and the td tags you are looking for are always preceded by exactly 2 spaces on the line (this solution can easily be modified if that is not the case), you can do:

awk -F'</?td>' '/^<tr/{i=0} /^  <td/{i++} i==2{print $2}' input-file

The first argument tells awk to split each line on either <td> or </td>, so the data you are interested in becomes the 2nd field. The first clause of the 2nd argument resets the counter i to zero whenever <tr appears at the start of a line. The next increments i each time <td appears after 2 spaces. The last prints the 2nd field for the 2nd <td> line. And the last argument specifies your input file.

Of course, that gives you everything between the <td> tags, which I see is not what you want. To just get the chunk of text between <td> and the first whitespace, try:

awk '/^<tr/{i=0} /^  <td/{i++} i==2{gsub( "<td>", ""); print $1}' input-file

Combine multiple sed commands

5 Answers5

Linked

Related