extract substring from lines using grep, awk,sed or etc

Question

I have a files with many lines like:

<a href="http://www.youtube.com/user/airuike" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKPW6LXqqbQCFSqVIQod_BwsaQ%3D%3D" dir="ltr">lily weisy</a>

I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/

so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy

how to achieve this? thanks

Do you have to use awk or grep? There are better ways to parse HTML. — Will C., Dec 21 '12 at 00:29
[Google](http://google.com) is a great resource for learning how to do things that you don't know how to do. — jahroy, Dec 21 '12 at 00:43
[Regex are not for HTML](http://stackoverflow.com/a/1732454/76722), use an actual HTML parser instead. — Sebastian Mach, Dec 21 '12 at 13:45

score 3 · Accepted Answer · edited Dec 21 '12 at 15:51

3

do this:

sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link:\1 name:\2/' < data

will give you the first part. But I'm not sure what you are doing with it after this.

edited Dec 21 '12 at 15:51

Sebastian Mach

38,570
8
95
130

answered Dec 21 '12 at 00:47

kdubs

1,596
1
21
36

what if a developer writes ` – Sebastian Mach Dec 21 '12 at 13:42
the first .* will skip everything up to the href, so it should still work. – kdubs Dec 21 '12 at 14:55
but _should_ it work? of course it is unclear by the question, but I believe that regex are simply not the right job for HTML, except in one-shot-hacks (but not as part of real projects) – Sebastian Mach Dec 21 '12 at 14:59
oh, I don't disagree with that. but the question was asking for sed. I'd rather pull it into something that can parse the DOM, but that wasn't question. So it will work on the given string, in another context who knows – kdubs Dec 21 '12 at 15:43

BeniBela · Answer 2 · 2012-12-21T01:13:19.117

Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.

 xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := @href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'

Or if you want a CSV like result:

 xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((@href, substring-after(@href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names

It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as

xidel /yourfile.html  -e '<a href="{$link}" class="yt-uix-sessionlink yt-user-name ">{$name}</a>*'

(and you were supposed to be able to use xidel '<a href="{$link:=., $user := filter($link, www.youtube.com/user/(.*)\', 1)}" class="yt-uix-sessionlink yt-user-name " dir="ltr">{$name}</a>*', but it seems I haven't thought the syntax through. Just one error check and it is breaking everything. )

score 1 · Answer 3 · answered Dec 21 '12 at 06:44

1

$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy

answered Dec 21 '12 at 06:44

Ed Morton

188,023
17
78
185

Pau Fracés · Answer 4 · 2012-12-21T01:01:36.580

0

I think something like this must work

while read line
do
    href=$(echo $line | grep -o 'http[^"]*')
    user=$(echo $href | grep -o '[^/]*$')
    text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')

    echo href: $href
    echo user: $user
    echo text: $text
done < yourfile

Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions

Upd: checked and fixed

edited Dec 21 '12 at 01:01

answered Dec 21 '12 at 00:52

Pau Fracés

1,077
10
22

will this not also match `Lucy's new "http-parser"!`? – Sebastian Mach Dec 21 '12 at 13:44
you are right, but I've supposed that the formatting was homogenic. If you know that all lines are in the same format, the regex can be simpler – Pau Fracés Dec 21 '12 at 15:56

extract substring from lines using grep, awk,sed or etc

4 Answers4