0
<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>

I want to extract the words "with 3km/h SSW winds" (note this string will change so hardcoding it wont work) from the line above using the 'grep' command. I have been trying for a long time and am completely lost. Any help would be appreciated.

noobcoder
  • 6,089
  • 2
  • 18
  • 34
  • Your input is an xml, it would be better to use an xml parser. But if you really want shell scripts, you can use `sed` or `awk`. – alvits Mar 29 '14 at 02:29
  • thought so but i dont know how else to do it. any commands that could help me with this? I am new to bash – noobcoder Mar 29 '14 at 02:29
  • @alvits ahhhhh cant use sed or awk unfortunately – noobcoder Mar 29 '14 at 02:30
  • 1
    You might want to look at `xmlstarlet` instead. – devnull Mar 29 '14 at 02:38
  • 1
    Don't parse HTML with regex http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. But if you so desire, I'll add an answer – zedfoxus Mar 29 '14 at 02:41

4 Answers4

2

Here's a GNU grep solution that uses -P to activate support for PCREs (Perl-Compatible Regular Expressions):

grep -Po '"cur_wind">\K[^<]+' \
  <<<'<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'
  • -o specifies that only the matching string be output
  • \K is a PCRE-feature that drops everything matched so far; this allows providing context for more specific matching without including that context in the match.

Another option is to use a look-behind assertion in lieu of \K:

 grep -Po '(?<="cur_wind">)[^<]+' \
  <<<'<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'

Of course, this kind of matching relies on the specific formatting of the input string (whitespace, single- vs. double-quoting, ordering of attributes, ... - in addition to the fundamental problem of grep not understanding the structure of the data) and is thus fragile.

Thus, in general, as others have noted, grep is the wrong tool for the job.

On OSX, assuming the input is XML (or XHTML), you can parse robustly with the stock xmllint utility and an XPath expression:

xmllint --xpath '//span[@class="cur_wind"]/text()' - <<<\
 '<td><span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'

Here's a similar solution using a third-party utility, the multi-platform web-scraping utility xidel (which handles both HTML and XML):

xidel -q -e '//span[@class="cur_wind"]' - <<<\
 '<td><span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>'
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • This is slick and informative, despite knowing that OP's homework is forcing the OP to use the wrong tool for the job. Nice work +1 – zedfoxus Mar 29 '14 at 03:15
1

Try sed:

echo '<span class="cur_wind">with 3km/h SSW winds</span><hr class="hr_sm" /></td>' | sed -e 's/<[^>]*>//g'

Output

with 3km/h SSW winds

Explanation

  • echo 'whatever' will echo the word whatever to the screen (stdandard output aka stdout)
  • The | symbol is a pipe. Command to the right of that will take the output from echo and do something with it
  • sed is stream editor. It's -e switch tells sed to evaluate a script or expression
  • s/xyz/abc/g format is simple. s/ means substitute. /g means globally. Substitute all occurrences of xyz with abc globally
  • s/<[^>]*>//g gets interesting. Let's focus on <[^>]*>. It means, substitute anything that starts with <, does not contain > immediately but contains any other character and then has > with empty
  • Check out your <span class="cur_wind"> for example. That tag starts with <, then contains characters immediately after and then has a >. sed says, when such text is found, chop it off (replace with empty)
  • Same technique is applied for <hr> and </td>. What remains is the text you want

This is a somewhat simplified explanation.

zedfoxus
  • 35,121
  • 5
  • 64
  • 63
  • thank you, but this is a part of my homework and i am not allowed to use sed or awk – noobcoder Mar 29 '14 at 02:48
  • 1
    @noobcoder, the lecturer who assigns `grep` for extracting content from XML shouldn't have their job. XML is not a regular language, so correctly parsing it with regular expressions (the only method grep has available) is not even theoretically possible. You can write something that's a bad approximation, but it's only ever that -- a bad approximation. – Charles Duffy Mar 29 '14 at 02:56
  • I do agree with @CharlesDuffy that `grep` shouldn't be the tool to extract data. Grep is for matching/finding...as in...find `ridiculous homework problem` from `profanity.txt` file. And as my previous comment mentioned, HTML shouldn't be parsed with regex – zedfoxus Mar 29 '14 at 03:01
1

grep doesn't know XML, and thus is the wrong tool for the job; use a real XML parser. One of the better ones easily accessible from bash is XMLStarlet.

xmlstarlet sel -t -m "//span[@class='cur_wind']/text()" -v . -n <input.xml

This extracts all text directly contained within a span of the class cur_wind.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
0

if that is all you want then cat | grep ".with 3km/h SSW winds." should do it, but i suspect there is more then that that you need

nPn
  • 16,254
  • 9
  • 35
  • 58