2

I am quite new to python and programming, and all I know how to do is write simple scripts for my routine office work. However, I have run into a scenario where I have to use python to access a particular webpage, which is a search output of a particular bioinformatics web server.

In that webpage, there is a table, in which the second column is a hyperlink that opens up a little pop up box with a FASTA file of the protein sequence.

I would like to be able to write a script that clicks these links systematically, one after the other, copy the FASTA sequence of each of the links, and paste them into a text file.

Is this kind of automation possible with python? If so, where do I start, in terms of modules to access internet explorer/webpages etc.? If you could kindly guide me in the right direction or give me an example script, I could try and do it myself!

Thank you so much!

I would post what I have tried, but I have literally no idea where to start!

user1998510
  • 75
  • 1
  • 9
  • 1
    Look into http://docs.python-requests.org/en/latest/ and http://www.crummy.com/software/BeautifulSoup/ – konart Jun 04 '15 at 08:43
  • Try reading the urllib doc here : https://docs.python.org/2/library/urllib.html – Pierre.Sassoulas Jun 04 '15 at 08:43
  • 1
    Could you give the name/URL of the specific bioinformatics web server you are trying to access? – BioGeek Jun 04 '15 at 08:46
  • Your problem falls in the category we call webscraping. See this location for a bunch of python tools that can help you http://stackoverflow.com/questions/2081586/web-scraping-with-python – Chris Wesseling Jun 04 '15 at 09:00
  • Thank you all so much for your responses. @BioGeek, the web server is a proprietary server from Thomson reuters. You can make a trial account for 30 days if you want to give it a whirl. Its called SequenceBase. https://usgene.sequencebase.com – user1998510 Jun 04 '15 at 09:07

1 Answers1

3

This takes about a minute and a half to run for me, after which it opens a text file with the sequences. You need of course to add your credentials, etc. at the end.

import os
import mechanize
import cookielib
from bs4 import BeautifulSoup
from urlparse import urljoin

class SequenceDownloader(object):

    def __init__(self, base_url, analyzes_page, email, password, result_path):
        self.base_url = base_url
        self.login_page = urljoin(self.base_url, 'login')
        self.analyzes_page = urljoin(self.base_url, analyzes_page)
        self.email = email
        self.password = password
        self.result_path = result_path
        self.browser = mechanize.Browser()
        self.browser.set_handle_robots(False)

        # set cookie
        cj = cookielib.CookieJar()
        self.browser.set_cookiejar(cj)

    def login(self):
        self.browser.open(self.login_page)
        # select the first (and only) form and log in
        self.browser.select_form(nr=0)
        self.browser.form['email'] = self.email 
        self.browser.form['password'] = self.password 
        self.browser.submit()

    def get_html(self, url):
        self.browser.open(url)
        return self.browser.response().read()

    def scrape_overview_page(self, html):
        sequences = []
        soup = BeautifulSoup(html)
        table = soup.find('table', {'class': 'styled data-table'})
        table_body = table.find('tbody')

        rows = table_body.find_all('tr', {'class': 'search_result'})
        for row in rows:
            cols = row.find_all('td')
            sequence_url = cols[1].a.get('href')
            sequence_html = self.get_html(sequence_url)
            sequence_soup = BeautifulSoup(sequence_html)
            sequence = sequence_soup.find('pre').text
            sequences.append(sequence)
        return sequences

    def save(self, sequences):
        with open(result_path, 'w') as f:
            for sequence in sequences:
                f.write(sequence + '\n')

    def get_sequences(self):
        self.login()
        overview_html = self.get_html(self.analyzes_page)
        sequences = self.scrape_overview_page(overview_html)
        self.save(sequences)


if __name__ == '__main__':
    base_url = r'https://usgene.sequencebase.com'
    analyzes_page = 'user/reports/123/analyzes/9876'
    email = 'user1998510@gmail.com'
    password = 'YourPassword'
    result_path = r'C:path\to\result.fasta'

    sd = SequenceDownloader(base_url, analyzes_page, email, password, result_path)
    sd.get_sequences()
    os.startfile(result_path)
BioGeek
  • 21,897
  • 23
  • 83
  • 145
  • 1
    Wow! I am so grateful for people like you, to take the time to go through a stranger's problems, dedicate their time to answer the question and educate beginners like me! Thank you so much for this script. I am studying it closely to see which modules you have used ( I have downloaded mechanize, BeautifulSoup). I am assuming cookielib and urlparse/urljoin are part of python 2.7 (the version I have), as it didn't flag any errors. I changed the username/id/analyses page etc options at the bottom. I am getting an error still when I run the script though. I post this in the next message. – user1998510 Jun 04 '15 at 11:17
  • `Traceback (most recent call last): File "Path/Sequence_retrieval_from_SequenceBase_Stackoverflow.py", line 71, in sd.get_sequences() File "Path/Sequence_retrieval_from_SequenceBase_Stackoverflow.py", line 59, in get_sequences sequences = self.scrape_overview_page(overview_html) File "Path/Sequence_retrieval_from_SequenceBase_Stackoverflow.py", line 37, in scrape_overview_page soup = BeautifulSoup(html) TypeError: 'module' object is not callable` – user1998510 Jun 04 '15 at 11:19
  • 1
    Ah, its to do with BeautifulSoup 3 / Beautiful Soup 4. I got the bs4 now. – user1998510 Jun 04 '15 at 12:00
  • 1
    Hey BioGeek, sorry for the delay in responding, but yes! in the end it worked perfectly with Bs4! great script. Ofcourse, the website themselves added the function after I complained that users shouldn't be made to write their own scripts for the money we pay to use that service! – user1998510 Oct 09 '15 at 09:16
  • @Meet Please create a new question with the website URL and what you tried. – BioGeek Jan 12 '22 at 13:34