0

So I am trying to extract only links to particular sites. I have written the following by sifting through this site for hours now, but it does not work great for me.

match = re.compile('<a href="(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)(youtu|www.youtube|youtube|vimeo|dailymotion|)\.(.+?)"',re.DOTALL).findall(html)
for title in match:
    print '<a href="'+title+'>'+title+'</a>'

Method above gives this error:

    print '<a href="'+title+'>'+title+'</a>'
TypeError: cannot concatenate 'str' and 'tuple' objects

and if i simply put "print = title" I get the following ugly result

('https://www.', 'youtube', 'com/watch?v=gm2SGfjvgjM')

all links scraped will look like this:

<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM"

Im hoping to have it print like following:

<a href="https://www.youtube.com/watch?v=gm2SGfjvgjM">youtube</a>
<a href="http://www.dailymotion.com/video/x5zuvuu">dailymotion</a>

Been playing with python for a while but I struggle alot lol. and FYI Ive spent endless hours trying to figure out beautiful soup but just dont get it. Would appreciate any help on this without changing the method totally if possible, Thanks for any help.

cs95
  • 379,657
  • 97
  • 704
  • 746
Bobby Peters
  • 181
  • 1
  • 1
  • 9
  • Try running your code here: http://pythontutor.com –  Sep 11 '17 at 00:09
  • I will try Dani. Thanks have not seen that site before. What would be the benefit to testing in there as apposed to running in idle? – Bobby Peters Sep 11 '17 at 00:16
  • The reason you get the error is, you are trying to put together tuples and strings. If you are not sure at what point `title` becomes a string (though you can try figuring that out yourself), python tutor can help you, by showing you the steps the program takes, visually, 1 by 1. –  Sep 11 '17 at 00:18
  • Also, there probably is a solution without using regex, and you should definitely try that. https://stackoverflow.com/a/7553730/5306470 –  Sep 11 '17 at 00:19
  • Perfect Thanks Dani. I will continue to learn with the examples u have provided. – Bobby Peters Sep 11 '17 at 00:25
  • 1
    Regex is not ideal for parsing HTML. Use an HTML parser like BeautifulSoup. – Mark Tolonen Sep 11 '17 at 02:50
  • @MarkTolonen I am well aware of that but I suck way worse with beautifulsoup than i do with regex as it states in my description above. – Bobby Peters Sep 12 '17 at 01:55

2 Answers2

1

Your pattern seems okay. The problem is with the capturing groups inside. Make them all non-capturing with ?: so you only end up capturing the whole expression together.

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\
                         '(?:youtu|www.youtube|youtube|vimeo|dailymotion|)'\
                         '\.(?:.+?))"',re.DOTALL)
match = p.findall(html)
for title in match:
    print '<a href="' + title + '>' + title + '</a>'

To retain the link as well as the domain name, a another small change is needed – capture the whole expression, and the website name as two separate groups (the former also captures the latter):

p = re.compile('<a href="((?:http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)'\
                         '(youtu|www.youtube|youtube|vimeo|dailymotion|)'\
                         '\.(?:.+?))"',re.DOTALL)

match = p.findall(html)
for title in match:
    print '<a href="' + title[0] + '>' + title[1] + '</a>'

Access the groups by title[i].

cs95
  • 379,657
  • 97
  • 704
  • 746
  • That works pretty darn good COLDSPEED Thanks for your help. If im not pushing my luck tho could you help with adding host name as link title? I have always been under the impression that every instance of "(.+? or whatever)" would be a match i could name and print but in this case when I give it a name it tells me there are too many values to unpack. Any insight as to why its not a match would be useful info to. Thanks so much – Bobby Peters Sep 11 '17 at 00:30
  • 1
    @BobbyPeters Made an edit. Take a look and see if it works. – cs95 Sep 11 '17 at 00:34
  • 1
    @BobbyPeters Note that, if you pass capturing groups to `findall`, only the capture groups are returned. Knowing how this works helps you work around it. – cs95 Sep 11 '17 at 00:35
  • That works flawlessly I really appreciate your help. I now understand "Group" I have been playing with python for a cpl years but am not made to be a programmer lol, I have ADHD and have a hard time reading through documentation. I learn best by playing with other peoples codes. Thanks again :) Means alot to me – Bobby Peters Sep 11 '17 at 00:40
1

You can simply use:

print '<a href="'+''.join(title)+'>'+title[1]+'</a>'

Your matching element consists on a tuple where each element is a matching group. So, you join them together to form the URL, and the second element will be the group you what to use to name the link.

y.luis.rojo
  • 1,794
  • 4
  • 22
  • 41