I am trying to scrap some pages on a website here is example of code
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/static/favicon-f8a3a024b0.ico" rel="shortcut icon"/>
<link href="/opensearch_ggs.xml" rel="search" title="WEBSITE anime GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_ggs2.xml" rel="search" title="WEBSITE music GG" type="application/opensearchdescription+xml"/>
<link href="/opensearch_artists.xml" rel="search" title="WEBSITE artists" type="application/opensearchdescription+xml"/>
<link href="/opensearch_requests.xml" rel="search" title="WEBSITE requests" type="application/opensearchdescription+xml"/>
<link href="/opensearch_forums.xml" rel="search" title="WEBSITE forums" type="application/opensearchdescription+xml"/>
<link href="/opensearch_users.xml" rel="search" title="WEBSITE users" type="application/opensearchdescription+xml"/>
<link href="/feed/rss_ggs_all/GOODSTUFF" rel="alternate" title="WEBSITE - All GG" type="application/rss+xml"/>
<link href="/feed/rss_ggs_anime/GOODSTUFF" rel="alternate" title="WEBSITE - Anime GG" type="application/rss+xml"/>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
<span class="download_link">[<a href="https://WEBSITE.tv/GG/223197/download/GOODSTUFF" title="Download">DL</a>]</span>
Here is what I'm working with
for x in range(pages):
pagen += 1
url3 = url2[:40] + str(pagen) + url2[41:]
print "url3 = ", url3
ggs = br.open(url3)
#print "ggs = ", ggs.read()
soup = BeautifulSoup(ggs, "lxml")
print "soup = ", soup
trueurl = 'https://WEBSITE.tv'
#print trueurl
# Finds the gg links
download = soup.find_all(href=re.compile("GOODSTUFF"))
# print "download = ", download
#print 'download'
# For-Loop to download the ggs
for link in download:
sleep(10)
print 'loop'
gglink = link.get('href')
gglink = trueurl + gglink
print gglink
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(gglink, headers=hdr)
print req
res_init()
res = urllib2.urlopen(req)
#print res
directory = "/home/cyber/yen/" # gg directory, change as you please.
file += 1
print "Page", pagen, "of", pageout, ".....", file, 'ggs downloaded'
urllib.urlretrieve(gglink, directory + 'page' + str(pagen) + '_gg' + str(file) + ".gg")
I only want to download
https://WEBSITE.tv/GG/223197/download/GOODSTUFF
but it also grabs
/feed/rss_ggs_anime/GOODSTUFF
I do not want that.
The problem is findall matchs everything with GOODSTUFF, I tried to lessen it but doing this
for download in soup.find_all(href=re.compile("GOODSTUFF")):
if download.find("feed"):
continue
and it doesn't catch anything, tried rss too isntead of feed some results