0

I'm having trouble with some Regex code while scraping YouTube playlist pages. It mostly works fine but its picking up a couple of strange results

Expression:

(?<=v=)[a-zA-Z0-9-_]+(?=&)|(?<=[0-9]/)[^&\n]+|(?<=v=)[^&\n]+

Examples of what to pick out:

yXBckFyiMyU,
opWYnUpNtG8,
YFbLRZCExBk,
I_GZahAl-PQ,
G6F_iP-F7Fw

from links like this

https://www.youtube.com/watch?v=_ClmClS_Mqs&list=PL6422619E56951B73&index=5&feature=plpp_video

For the most part this appears to be working okay, however it is also picking up these instances

data-thumb="//i1.ytimg.com/vi/84GVRtJ1CvY/<FROM RIGHT ONWARDS IS WHAT IT MATCHES>default.jpg" ><span class="vertical-align"></span></span></span></span>

data-thumb="//i4.ytimg.com/vi/WNIPqafd4As/<FROM RIGHT ONWARDS IS WHAT IT MATCHES>default.jpg" alt="" class="thumb"></span></span></span><span class="clip"><span class="centering-offset"><span class="centering"><span class="ie7-vertical-align-hack">

Regex is rather daunting. Does anyone know whats wrong with the expression?

CitizenSmif
  • 103
  • 1
  • 3
  • 12
  • Have you considered using some HTML parser to build a tree of elements, and then only apply regular expressions to the links found in that tree? [Here](http://stackoverflow.com/a/1732454/960195) is a humorous opinion on parsing HTML with regular expressions versus a dedicated parser. – Adam Mihalcin Mar 23 '12 at 01:38
  • @Adam: We're not trying to parse arbitrary HTML - just URLs. Cthulu/Tony the Pony aren't going to consume your soul for *trying* to do this with regex. (Proper HTML and URL parsing libraries still recommended, though.) – Li-aung Yip Mar 23 '12 at 01:42

1 Answers1

4

As a suggestion, the strings you want to match are always 11 characters long. Instead of trying to match "as many characters as possible" using the + quantifier, instead match "exactly 11 characters" using the {11} quantifier.

This may cure the symptoms of the over-matching problem you are seeing, though I don't know why it's matching those strings in the first place. (They don't start with v=.)

You should probably clarify your alternations | by parenthesising:

((?<=v=)[a-zA-Z0-9-_]+(?=&))|((?<=[0-9]/)[^&\n]+)|((?<=v=)[^&\n]+)

and if your regex flavour supports verbose regular expressions (comments inside regexes) use them!


As a suggestion - parsing URLs with regex is nasty. I would instead:

  • get a list of all URLs on the page using a HTML parser (in Python I would use BeautifulSoup, which makes it very easy to get 'all links'.)
  • Parse each URL using parse_url() (more Python), obtaining a dictionary/hash of the GET attributes. Example:

The dictionary might look like

{
'v' : '_ClmClS_Mqs',
'list' : 'PL6422619E56951B73',
'index' : '5'
'feature' : 'plpp_video',
}

Then you can just ask for the GET attribute v. No regexes required.

This is python specific, but Java will have equivalents. The point is that regex is not always the best tool (just the most general tool.)

Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • +1 for a good answer, and I would add another +1 if I could for "regex is not always the *best* tool (just the *most general* tool)" – Adam Mihalcin Mar 23 '12 at 01:44
  • Thank you for your help. I plan on scraping a lot more in the future but I've almost got this whole project working so I'm going to stay with regex for the time being. Your suggestion pretty much solved the issue, only think is its now picking up '. Do you know why? – CitizenSmif Mar 23 '12 at 01:53
  • The facetious answer is "your regex isn't specific enough". ;) More seriously, your regex is really three regexes in one - can you try splitting them out to see which of the three sub-regexes is producing the erroneous match? (Debugging by divide and conquer.) – Li-aung Yip Mar 23 '12 at 01:56