-1

I have a files with many lines like:

<a href="http://www.youtube.com/user/airuike" class="yt-uix-sessionlink yt-user-name " data-sessionlink="ei=CKPW6LXqqbQCFSqVIQod_BwsaQ%3D%3D" dir="ltr">lily weisy</a>

I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/

so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy

how to achieve this? thanks

wenzi
  • 113
  • 2
  • 11

4 Answers4

3

do this:

sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link:\1 name:\2/' < data

will give you the first part. But I'm not sure what you are doing with it after this.

Sebastian Mach
  • 38,570
  • 8
  • 95
  • 130
kdubs
  • 1,596
  • 1
  • 21
  • 36
  • what if a developer writes ` – Sebastian Mach Dec 21 '12 at 13:42
  • the first .* will skip everything up to the href, so it should still work. – kdubs Dec 21 '12 at 14:55
  • but _should_ it work? of course it is unclear by the question, but I believe that regex are simply not the right job for HTML, except in one-shot-hacks (but not as part of real projects) – Sebastian Mach Dec 21 '12 at 14:59
  • oh, I don't disagree with that. but the question was asking for sed. I'd rather pull it into something that can parse the DOM, but that wasn't question. So it will work on the given string, in another context who knows – kdubs Dec 21 '12 at 15:43
1

Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.

 xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := @href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'

Or if you want a CSV like result:

 xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((@href, substring-after(@href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names

It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as

xidel /yourfile.html  -e '<a href="{$link}" class="yt-uix-sessionlink yt-user-name ">{$name}</a>*'

(and you were supposed to be able to use xidel '<a href="{$link:=., $user := filter($link, www.youtube.com/user/(.*)\', 1)}" class="yt-uix-sessionlink yt-user-name " dir="ltr">{$name}</a>*', but it seems I haven't thought the syntax through. Just one error check and it is breaking everything. )

BeniBela
  • 16,412
  • 4
  • 45
  • 52
1
$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

I think something like this must work

while read line
do
    href=$(echo $line | grep -o 'http[^"]*')
    user=$(echo $href | grep -o '[^/]*$')
    text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')

    echo href: $href
    echo user: $user
    echo text: $text
done < yourfile

Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions

Upd: checked and fixed

Pau Fracés
  • 1,077
  • 10
  • 22