So I am trying to extract only links to particular sites. I have written the following by sifting through this site for hours now, but it does not work great for me.
match = re.compile('<a href="(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)(youtu|www.youtube|youtube|vimeo|dailymotion|)\.(.+?)"',re.DOTALL).findall(html)
for title in match:
print '<a href="'+title+'>'+title+'</a>'
Method above gives this error:
print '<a href="'+title+'>'+title+'</a>'
TypeError: cannot concatenate 'str' and 'tuple' objects
and if i simply put "print = title" I get the following ugly result
('https://www.', 'youtube', 'com/watch?v=gm2SGfjvgjM')
all links scraped will look like this:
<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM"
Im hoping to have it print like following:
<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM">youtube</a>
<a href="http://www.dailymotion.com/video/x5zuvuu">dailymotion</a>
Been playing with python for a while but I struggle alot lol. and FYI Ive spent endless hours trying to figure out beautiful soup but just dont get it. Would appreciate any help on this without changing the method totally if possible, Thanks for any help.