Best way to Grep for html

Question

I'm having trouble using grep to through some html code.

I'm trying to find similar strings to this

<td><a href='/go/12229' target="_blank" rel="nofollow">product description here</a></td><td> $<font color='red'>0.25</font>

i'm trying to generalize formula to count each line that is under $0.25 the parts that will vary are the: href='/go/12229' the number after /go/ will change but always be a number 5 digits long

the product description can be alphanumeric with spaces and special characters

and the price can be anything from 0.01 to 0.25

I've tried making formulas like the one below but it either does not work or returns nothing.

grep -c "href='/go/'[*] target="_blank" rel="nofollow">*</a></td><td> $<font color='red'>[0].[0-2][0-9]</font>"

I think it has to do with me not escaping special characters correctly, but i'm not sure.

Any help is appreciated.

How much of that line is required to identify it? For example, is it enough to know that there's a 'go' href like `grep 'go\/[0-9]\{5\}'`. If so, I'd do that grep then pass it to awk/gawk ( or other scripting language ) to test the value. — n0741337, Apr 22 '13 at 18:17

n0741337 · Accepted Answer · 2013-04-23T01:04:50.543

0

Okay - this requires that each line be formated as in your example, but this should give you the link, description and prices where each line is between 0.01 and 0.25. The the contents of this code an put them in a file like "priceawk" and make it executable:

grep 'go\/[0-9]\{5\}' | awk -F"<" '
{
split( $7, price_arr, ">" )

if( price_arr[ 2 ] > 0.00 && price_arr[ 2 ] < 0.26 )
    {
    split( $3, link_arr, "'\''" )
    split( link_arr[ 3 ], desc_arr, ">" )
    printf( "%s %s %s\n", link_arr[ 2 ], desc_arr[ 2 ], price_arr[ 2 ] )
    }
} '

Then use it like:

cat input | priceawk

With a test input file I made from your line, I get the following kinds of output:

/go/12229 product description here 0.25
/go/13455 find this line2 0.01
/go/12334 find this line3 0.23
/go/34455 find this line4 0.16

The printf() can be improved to give your output in a different form, with a more useful delimiter than the current space.

edited Apr 23 '13 at 01:04

answered Apr 22 '13 at 18:41

n0741337

2,474
2
15
15

I'm not getting any otput do i want spaces where the returns are? and i want to put the path to my file where you have 'greptest'? – almyz125 Apr 22 '13 at 18:53
Yes. I'll reformat the answer to make it a file you can turn into an executable instead. – n0741337 Apr 22 '13 at 18:57
Hmm, i did what you said but the shell still returns with no output, maybe its easier to have it write the output to a file? Thank you for all your help by the way! – almyz125 Apr 22 '13 at 19:12
I also noticed that when i ran cat /path/to/input | /path/to/priceawk it created a new file called cat and within that file it has the path to my input file. – almyz125 Apr 22 '13 at 19:17
No problem. I just copied/pasted your example line into a file which I called "input", faked some extra lines of data, then piped it into the executable file I called "priceawk" to get some results. Did you paste the edit into a file and make it executable? Are the input lines different than what's in the question? – n0741337 Apr 22 '13 at 19:17
Yes i made it excitable and chowned it. The actual html file has a ton more code on it. – almyz125 Apr 22 '13 at 19:25
– almyz125 Apr 22 '13 at 19:29
I should also say that when i get the entire source of the page through curl it does not save any formatting so it is not set into lines of neat code it is just a page of code from top to bottom. – almyz125 Apr 22 '13 at 19:30
I think you have a couple of options then. For one you could break up the source material to have line breaks to help awk'ing it. Or you could check out [xpath](http://en.wikipedia.org/wiki/Xpath) parsing like this [xml parsing in bash](http://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash) or another variant depending on the scripting languages you have available and then manage everything from that scripting language. Things like [firebug](https://getfirebug.com/) etc could help you determine the best xpath to use to retrieve your data. – n0741337 Apr 22 '13 at 20:41
One more question, how would I grep for the product desctription? Any charecter including spaces and any length – almyz125 Apr 22 '13 at 21:26
I can basically search grep -o "Product Description $0.24" but i need a formula to allow anything in the product description area – almyz125 Apr 22 '13 at 21:39
From the sample in the question you can directly feed it to awk with a longer separator like `cat input | awk -F"nofollow.>" ' { split( $2, arr, "<" ); print arr[ 1 ]; }'` to get only the descriptions. For xpath, I couldn't say ( because I can't see the whole document ), but check out the [xpath usage examples](http://en.wikipedia.org/wiki/Xpath#Usage_examples) especially the "text()" calls. These are both brittle approaches because they assume the inputs won't change over time. Just remember that delimiters can be as big as they need to be and you can narrow xpath by their attributes – n0741337 Apr 22 '13 at 21:49
Replace 'Product Description' with `.*`. I also had to escape the dollar sign like `\$` before I could get it to work. Your comment points out to me that I need an updated copy of Unix in a Nutshell - mine doesn't even have the -o flag. Now I've learned something new! Thanks! You can use the same escape sequence to get any price/description like `grep -o ".* \$.*"` – n0741337 Apr 22 '13 at 22:16
if i do grep -o "$.*$ $0.24" input It returns everything in the file before the (.*) and ends the out put where it should at any ideas how to end the $.*$ at the part that is "> – almyz125 Apr 22 '13 at 22:28
Ah, I thought you had split up the input into lines. grep, awk, head etc all work on a line by line basis. To get them to work effectively, you need to change that one big line into several individual lines. `.*` is greedy here. If your descriptions are always a certain length, you could use that knowledge. Otherwise, you could edit the file with something like gvim and find/replace things that would make good line breaks or pipe the initial output into a awk/gawk to do the same thing. Your input line maybe be too big for sed or even awk or gvim parsing. Or try xpath parsing out. – n0741337 Apr 22 '13 at 22:39
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/28695/discussion-between-almyz125-and-n0741337) – almyz125 Apr 23 '13 at 00:49

Best way to Grep for html

1 Answers1