download file using python beautifulsoup and selenium

Question

I want to download to download the first pdb file from search result (download link given below name). I am using python, selenium and beautifulsoup. I have developed code till this point.

import urllib2
from BeautifulSoup import BeautifulSoup
from selenium import webdriver


uni_id = "P22216"

# set parameters
download_dir = "/home/home/Desktop/"
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id

print "url - ", url


# opening the url
text = urllib2.urlopen(url).read();

#print "text : ", text
soup = BeautifulSoup(text);
#print soup
print


table = soup.find( "table", {"class":"queryBlue"} )
#print "table : ", table

status = 0
rows = table.findAll('tr')
for tr in rows:
    try:
        cols = tr.findAll('td')
        if cols:
            link = cols[1].find('a').get('href')
        print "link : ", link
            if link:
                if status==1:
                    main_url = "http://www.rcsb.org" + link
                print "main_url-----", main_url
                status = False
                browser.click(main_url)
        status+=1

    except:
    pass

I am getting form as None.
How can i download first file in the search list? (i.e. 2YGV in this case)

Download link is : /pdb/protein/P32447

Works for me. Getting `/pdb/explore/explore.do?structureId=2YGV`. What the problem? You can't download it? — 4d4c, Jan 07 '14 at 11:48
i also got that but how to download that file. dats my problem — sam, Jan 07 '14 at 16:02

score 2 · Accepted Answer · edited May 23 '17 at 11:49

I'm not sure what exactly are you trying to download but here is example of how to download 2YGV file:

import urllib
import urllib2
from bs4 import BeautifulSoup    

uni_id = "P22216"    
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id    
text = urllib2.urlopen(url).read()    
soup = BeautifulSoup(text)    
link = soup.find( "span", {"class":"iconSet-main icon-download"}).parent.get("href")    
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")

This script will download that file from the link on the page. This script doesn't need selenium, but I used urllib to retrieve file. You can read this post for more info how to download files with urllib.

Edit:

Or use this code to find the download link(it all depends on what files from what URL you want to download):

import urllib
import urllib2
from bs4 import BeautifulSoup


uni_id = "P22216"
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
table = soup.find( "table", {"class":"queryBlue"} )
link = table.find("a", {"class":"tooltip"}).get("href")
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")

Here is example of how you could do what you asked in comment:

import mechanize
from bs4 import BeautifulSoup


SEARCH_URL = "http://www.rcsb.org/pdb/home/home.do"

l = ["YGL130W", "YDL159W", "YOR181W"]
browser = mechanize.Browser()

for item in l:
    browser.open(SEARCH_URL)
    browser.select_form(nr=0)
    browser["q"] = item
    html = browser.submit()

    soup = BeautifulSoup(html)
    table = soup.find("table", {"class":"queryBlue"})
    if table:
        link = table.find("a", {"class":"tooltip"}).get("href")
        browser.retrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")[0]
        print "Downloaded " + item + " as " + str(link.split("=")[-1]) + ".pdb"
    else:
        print item + " was not found"

Output:

Downloaded YGL130W as 3KYH.pdb
Downloaded YDL159W as 3FWB.pdb
YOR181W was not found

i read and understand your code. thanks. I have list l = [YGL130W, YDL159W, YOR181W]. with this I have to go to http://www.rcsb.org/pdb/home/home.do and then I have to take each id and search in that site. result page has a link search pdb. I have to click on that and then I get the download pdb page or I will get multiple pdbs. If multiple pdbs then I have to download 1st pdb of search result. — sam, Jan 08 '14 at 05:54

download file using python beautifulsoup and selenium

1 Answers1