1

I've been trying to scrape the search result of the AlphaFold Protein Structure Database and couldn't find the desired information in the scraping result. So my idea is that, e.g., if I put the search key word "Alpha-elapitoxin-Oh2b" in the search bar and click the search button, it will generate a new page with the URL: https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b In google chrome, I used "inspect" to check the code for this page and found my desired search result, i.e. the I.D. for this protein: P82662. However, when I used requests and bs4 to scrape this page. I couldn't find the desired "P82662" in the returned information, also not even the search words "Alpha-elapitoxin-Oh2b"

import requests
from bs4 import BeautifulSoup
response = requests.get('https://alphafold.ebi.ac.uk/search/text/Alpha-elapitoxin-Oh2b')
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())

I searched StackOverflow and tried to find a solution of not being able to find the result with BS4 and requests and found someone said that it is because the page of the search result was wrapped with JavaScript. So is it true? How can I solve this problem?

Thanks!

Jiang Xu
  • 91
  • 1
  • 10

1 Answers1

1

The desired search data is loaded dynamically from external source via API as json format as get method. So bs4 getting empty ResultSet.

import requests

res= requests.get('https://alphafold.ebi.ac.uk/api/search?q=%28text%3A%2aAlpha%5C-elapitoxin%5C-Oh2b%20OR%20text%3AAlpha%5C-elapitoxin%5C-Oh2b%2a%29&type=main&start=0&rows=20')
    
for item in res.json()['docs']:
    id_num =item['uniprotAccession']
    print(id_num)

Output:

P82662

Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • 1
    Hi,Fazlul, thank you for your answer. I am just curious about how do you find such page. I googled alphafold but didn't know the existance of such site. Also how to convert the string from Alpha-elapitoxin-Oh2b to something like %28text%3A%2aAlpha%5C-elapitoxin%5C-Oh2b%20OR%20text%3AAlpha%5C-elapitoxin%5C-Oh2b%2a%29&type=main&start=0&rows=20 Thanks. – Jiang Xu Nov 12 '22 at 05:32
  • 1
    @Jiang Xu, Thanks. when you inspected the page after reaching from network tab to XHR then you have to refresh webpage from the far left top circular icon then the API url along with other stuffs will appear. You don't need to make the urlencode,it's the api url's default but you can also see that portion's decoded value in payload tab as querystring parameters. – Md. Fazlul Hoque Nov 12 '22 at 06:01
  • 1
    You also can find a couple of discussions about how to find api url from here: https://stackoverflow.com/questions/1820927/request-monitoring-in-chrome/3019085#3019085 – Md. Fazlul Hoque Nov 12 '22 at 06:09