2

I need to extract the video names from youtube's index.html. I have been able to break apart the file into small chunks, each containing one video listing, however I cannot seem to extract the video title. My professor has provided the following command, however I cannot seem to get it to work in this situation.

number=`expr "$s" : ".*\/\([0-9,]*\)\/"`; echo $number # will print 250,4211

Although I'm not completely sure, I think I'm having trouble getting this method to work because there aren't spaces between the video title and surrounding text. Here is a sample of what I would need to extract the title from:

<li class="video-list-item "><a href="/watch?v=9BbgvlgDQMg&amp;feature=g-sptl&amp;cid=inp-hs-edt" class="video-list-item-link yt-uix-sessionlink" data-sessionlink="ei=CMzmroaB5bICFRiXIQoda3kX5g%3D%3D&amp;feature=g-sptl%26cid%3Dinp-hs-edt" ><span class="ux-thumb-wrap contains-addto "><span class="video-thumb ux-thumb yt-thumb-default-120 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin" data-thumb="//i2.ytimg.com/vi/9BbgvlgDQMg/default.jpg" width="120" ><span class="vertical-align"></span></span></span></span><span class="video-time">3:51</span>

Out of this chunk of text, I would need to extract "Lil' Buck "Golden Gateway" Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin", without the quotes.

Kevin
  • 53,822
  • 15
  • 101
  • 132
Mike
  • 537
  • 2
  • 6
  • 20

2 Answers2

1

You can use the bash regex \<img.*alt=\"([^\"]*)\" to extract the alt text from the img element.

Example:

$ cat file
<li class="video-list-item "><a href="/watch?v=9BbgvlgDQMg&amp;feature=g-sptl&amp;cid=inp-hs-edt" class="video-list-item-link yt-uix-sessionlink" data-sessionlink="ei=CMzmroaB5bICFRiXIQoda3kX5g%3D%3D&amp;feature=g-sptl%26cid%3Dinp-hs-edt" ><span class="ux-thumb-wrap contains-addto "><span class="video-thumb ux-thumb yt-thumb-default-120 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin" data-thumb="//i2.ytimg.com/vi/9BbgvlgDQMg/default.jpg" width="120" ><span class="vertical-align"></span></span></span></span><span class="video-time">3:51</span>

$ line="$(cat file)"

$ if [[ "$line" =~ \<img.*alt=\"([^\"]*)\" ]]
then
  echo "${BASH_REMATCH[1]}"
fi
Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin

Update:

Using expr:

$ expr "$line" : '.*<img.*alt=\"\([^\"]*\)\".*'
Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin
dogbane
  • 266,786
  • 75
  • 396
  • 414
0

I suppose it is mandatory to use regex in your assignment... if not i would go for an xml parser...

But if YES I suggest you have a go with Reg Ex buddy

RegexBuddy makes it easier than ever for you to create regular expressions that do what you intend, without any guesswork. Still, you need to test your regex patterns to be 100% sure that they match what you want, and don't match what you don't want.

Frank
  • 16,476
  • 7
  • 38
  • 51
  • Thank you for your reply. Do you know if there is a way to do it simply by using the 'expr' command, as mentioned by my professor? – Mike Oct 03 '12 at 14:36
  • yes you can, but it is more easy to use the tool to find the right reg-ex string. – Frank Oct 03 '12 at 17:37