extract text beetwen two words and in a specific line

Question

I'm trying to make a linux bash script to download an html page, extract numbers from this html page and assign them to a variable.

the html page has several lines but I'm interested in these :

<tr>
      <td width="16"><img src="img/ico_message.gif"></td>
      <td width="180"><strong> TIME 1</strong></td>
      <td width="132">
        <div align="right"><strong>61</strong></div></td>
    </tr>
    <tr>
      <td width="16"><img src="img/ico_message.gif"></td>
      <td width="180"><strong> TIME 2</strong></td>
      <td width="132">
        <div align="right"><strong>65</strong></div></td>
    </tr>
  </table></td>

Every time I download the page I have to read the two values in row 5 and 11 between strong> and </strong (61 ad 65 in this example; 61 and 65 in this example, but each time they are different)

The two values extracted from html must be able to assign them to two variables

Thanks for any idea

Bash is not the right tool for the job. I'd use an HTML-aware tool ([xsh](http://metacpan.org/pod/distribution/XML-XSH2/xsh) in my case) if the markup isn't too broken, or [HTML::TableExtract](http://p3rl.org/HTML::TableExtract) in Perl. — choroba, Jun 28 '18 at 11:16
You should use an `xpath` utility to parse xml/html. There are command line `xpath` tools you can invoke from a bash script. — ccarton, Jun 28 '18 at 11:22
Welcome to Stack Overflow! Sorry, this is not the way StackOverflow works. Questions of the form "I want to do X, please give me tips and/or sample code" are considered off-topic. Please visit the [help] and read [ask], and especially read [Why is “Can someone help me?” not an actual question?](http://meta.stackoverflow.com/q/284236) — kvantour, Jun 28 '18 at 11:58
Have a look at [this](https://stackoverflow.com/a/50713910/8344060) answer which shows how to extract links from an html using Xpath. And look at [this](http://www.zvon.org/xxl/XPathTutorial/General/examples.html) page to understand Xpath. With these two, I am 100% convinced you can do it ;-). If you still don't manage, please post your efforts here and we gladly help you out. — kvantour, Jun 28 '18 at 12:02

score 0 · Answer 1 · answered Jun 29 '18 at 05:06

Let's assume we a page called page.html. You can firstly select the line with grep, then extract the value with sed and finally select values iteratively with awk:

$ var0=$(cat page.html |\
    grep -Ee "<strong>[0-9]+</strong>" -o |\
     sed  -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
      awk 'NR%2==1')

$ var1=$(cat page.html |\
    grep -Ee "<strong>[0-9]+</strong>" -o |\
     sed  -Ee "s/<strong>([0-9]+)<\/strong>/\1/g" |\
      awk 'NR%2==0')

output:

$ echo $var0
61
$ echo $var1
65

score 0 · Answer 2 · answered Jun 29 '18 at 07:59

0

This might work for you (GNU sed):

sed -rn '/TIME/{:a;N;5bb;11bb;ba;:b;s/.*TIME ([^<]*).*<strong>([^<]*).*/var\1=\2/p}' file

Use the integer associated with the TIME in the preceding code to differentiate the two variable names.

answered Jun 29 '18 at 07:59

potong

55,640
6
51
83

extract text beetwen two words and in a specific line

2 Answers2