Substring extract from html: BASH

Question

I need to extract the video names from youtube's index.html. I have been able to break apart the file into small chunks, each containing one video listing, however I cannot seem to extract the video title. My professor has provided the following command, however I cannot seem to get it to work in this situation.

number=`expr "$s" : ".*\/\([0-9,]*\)\/"`; echo $number # will print 250,4211

Although I'm not completely sure, I think I'm having trouble getting this method to work because there aren't spaces between the video title and surrounding text. Here is a sample of what I would need to extract the title from:

<li class="video-list-item "><a href="/watch?v=9BbgvlgDQMg&amp;feature=g-sptl&amp;cid=inp-hs-edt" class="video-list-item-link yt-uix-sessionlink" data-sessionlink="ei=CMzmroaB5bICFRiXIQoda3kX5g%3D%3D&amp;feature=g-sptl%26cid%3Dinp-hs-edt" ><span class="ux-thumb-wrap contains-addto "><span class="video-thumb ux-thumb yt-thumb-default-120 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin" data-thumb="//i2.ytimg.com/vi/9BbgvlgDQMg/default.jpg" width="120" ><span class="vertical-align"></span></span></span></span><span class="video-time">3:51</span>

Out of this chunk of text, I would need to extract "Lil' Buck "Golden Gateway" Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin", without the quotes.

Obligatory answer: http://stackoverflow.com/a/1732454/1032785 — jordanm, Oct 03 '12 at 14:26

dogbane · Accepted Answer · 2012-10-03T14:46:16.657

You can use the bash regex \<img.*alt=\"([^\"]*)\" to extract the alt text from the img element.

Example:

$ cat file
<li class="video-list-item "><a href="/watch?v=9BbgvlgDQMg&amp;feature=g-sptl&amp;cid=inp-hs-edt" class="video-list-item-link yt-uix-sessionlink" data-sessionlink="ei=CMzmroaB5bICFRiXIQoda3kX5g%3D%3D&amp;feature=g-sptl%26cid%3Dinp-hs-edt" ><span class="ux-thumb-wrap contains-addto "><span class="video-thumb ux-thumb yt-thumb-default-120 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin" data-thumb="//i2.ytimg.com/vi/9BbgvlgDQMg/default.jpg" width="120" ><span class="vertical-align"></span></span></span></span><span class="video-time">3:51</span>

$ line="$(cat file)"

$ if [[ "$line" =~ \<img.*alt=\"([^\"]*)\" ]]
then
  echo "${BASH_REMATCH[1]}"
fi
Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin

Update:

Using expr:

$ expr "$line" : '.*<img.*alt=\"\([^\"]*\)\".*'
Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin

Thank you, worked perfectly. I really appreciate the help. Now to implement it. — Mike, Oct 03 '12 at 14:44

score 0 · Answer 2 · answered Oct 03 '12 at 14:28

0

I suppose it is mandatory to use regex in your assignment... if not i would go for an xml parser...

But if YES I suggest you have a go with Reg Ex buddy

RegexBuddy makes it easier than ever for you to create regular expressions that do what you intend, without any guesswork. Still, you need to test your regex patterns to be 100% sure that they match what you want, and don't match what you don't want.

answered Oct 03 '12 at 14:28

Frank

16,476
7
38
51

Thank you for your reply. Do you know if there is a way to do it simply by using the 'expr' command, as mentioned by my professor? – Mike Oct 03 '12 at 14:36
yes you can, but it is more easy to use the tool to find the right reg-ex string. – Frank Oct 03 '12 at 17:37

Substring extract from html: BASH

2 Answers2