Python Grab all links from a html and only display the links

Question

I'm trying to grab the title out with of a webpage using the following statement:

titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)

Using that, I get ['random webpage example1']. How do I remove the quotes and brackets?

I'm also trying to grab a set of links that change hourly (which is why I need the wildcard) using this: links = re.findall(r'(file=(.*?).mp3)',the_webpage).

I get

[('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
 ('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521'), 
 ('file=http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3', 
  'http://media.kickstatic.com/kickapps/images/3380/audios/944521')]

How do I get the mp3 links without the file=?

I also want to download the mp3 files and append them with the title of the website so it will show

random webpage example1.mp3

How would I do this? I'm still learning Python and regex and this is kinda stumping me.

[regex is generally not a good candidate for parsing XML/HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). You might find [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) useful -- grabbing all links would be as simple as `soup.find_all('a')`. Take a look at [the docs](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). — Shawn Chin, Aug 01 '12 at 20:59
You should look at BeautifulSoup which is more suitable for url parsing. — xbb, Aug 01 '12 at 20:59
Oh.. and you might find this useful for formatting your question: http://stackoverflow.com/editing-help — Shawn Chin, Aug 01 '12 at 21:02

score 0 · Answer 1 · edited May 23 '17 at 11:52

At least for part 1, you could do

>>> mytitle = title1[0]
>>> print mytitle
random webpage example1

The regex is returning a list of strings that match, so you just need to grab the first item on the list.

Similarly, for part two, the regex is returning a list with tuples inside. You could do:

>>> download_links = [href for (discard, href) in links]
>>> print download_links
['http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521', 'http://media.kickstatic.com/kickapps/images/3380/audios/944521']

As for download files, use urlib2 (at least for python 2.x, not sure about python 3.x). See this question for details.

ffledgling · Answer 2 · 2012-08-01T21:30:00.890

For the 1st part titl1 = re.findall(r'<title>(.*?)</title>',the_webpage) will return a list and when you print a list it is printed with the brackets and quotes. So try print title[0] if you are sure there will always be only one match. (You can also try re.search instead)

For the second part if you change your re pattern from "(file=(.*?)\.mp3)" to "file=(.*?)\.mp3" you will get only the 'http://linkInThisPart/path/etc/etc' part you'll need to add the .mp3 extension though.

i.e

audio_links = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',web_page)]

To download the files you might want to look into urllib,urllib2

import urllib2
url='http://media.kickstatic.com/kickapps/images/3380/audios/944521.mp3'
req=urllib2.Request(url)
temp_file=open('random webpage example1.mp3','wb')
buffer=urllib2.urlopen(req).read()
temp_file.write(buff)
temp_file.close()

so when i use the links audio_links = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',web_page)] all i get for a return are ['', '', ''] — jokajinx, Aug 03 '12 at 13:15

score 0 · Answer 3 · edited Aug 28 '12 at 18:23

Code:

#!/usr/bin/env python

import re,urllib,urllib2

Url = "http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000"
print Url
print 'test .............'
req = urllib2.Request(Url)
print "1"
response = urllib2.urlopen(req)
print "2"
the_webpage = response.read()
print "3"
titl1 = re.findall(r'<title>(.*?)</title>',the_webpage)
print "4"
a2 = [x +'.mp3' for x in re.findall(r'file=(.*?)\.mp3',the_webpage)]
print "5"
a2 = [x[0][5:] for x in a2]
print "6"
ti = titl1[0]
print ti
print "7"
print a2
print "8"

print "9"
#print the_page
print "10"

req=urllib2.Request(a2)
print "11"
temp_file=open(ti)
print "12"
buffer=urllib2.urlopen(req).read()
print "13"
temp_file.write(buff)
print "14"
temp_file.close()
print "15"
print "16"

Results

http://www.ihiphopmusic.com/music/rick-ross-sixteen-feat-andre-3000
test .............
1
2
3
4
5
6
Rick Ross - Sixteen (feat. Andre 3000)
7
['', '', '']
8
9
10
Traceback (most recent call last):
  File "grub.py", line 29, in <module>
    req=urllib2.Request(a2)
  File "/usr/lib/python2.7/urllib2.py", line 198, in __init__
    self.__original = unwrap(url)
  File "/usr/lib/python2.7/urllib.py", line 1056, in unwrap
    url = url.strip()
AttributeError: 'list' object has no attribute 'strip'

score 0 · Answer 4 · answered Jun 01 '17 at 05:14

Python 3:

import requests
import re
from urllib.request import urlretrieve

- First get the HTML text

html_text=requests.get('url')

- regex find the urls

Regex pattern, match('pattern','text',flags)

In the pattern, the '()' is used to group what you want. In this case, we group 'http://*****.mp3', and you can quote it by using .group(1) or groups().

url_find=re.findall('file=(http://media.mp3*',html_text)
for url_match in url_matches:
    index += 1
    print(url_match)
    urlretrieve(url_match, './graber/mp3/user' + str(index) + '.mp3')

That's how I complete, hope this will be helpful.(download things there are multiple ways, in this case,I use urlretrieve)

Python Grab all links from a html and only display the links

4 Answers4