1

For each row in the table on this page, I would like to click on the ID (e.g. the ID of row 1 is 270516746) and extract/download the information (which does NOT have the same headers for each row) into some form of python object, ideally either a json object, or a dataframe (json is probably easier).

I've gotten to the point where I can get to the table I want to pull down:

import os
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import pandas as pd
import sys

driver = webdriver.Chrome()
driver.get('http://mahmi.org/explore.php?filterType=&filter=&page=1')

#find the table with ID, Sequence, Bioactivity and Similarity
element = driver.find_elements_by_css_selector('table.table-striped tr')
for row in element[1:2]: #change this, only for testing
        id,seq,bioact,sim = row.text.split()


#now i've made a list of each rows id, sequence, bioactivity and similarity.
#click on each ID to get the full data of each
        print(id)
        button = driver.find_element_by_xpath('//button[text()="270516746"]') #this is one example hard-coded
        button.click()

 #then pull down all the info to a json file?
        full_table = driver.find_element_by_xpath('.//*[@id="source-proteins"]')
        print(full_table)

And then I'm stuck on what's probably the very last step, I can't find how to say '.to_json()', or '.to_dataframe()' once the button is clicked in the line above.

If someone could advise I would appreciate it.

Update 1: Deleted and incorporated into above.

Update 2: Further to suggestion below, to use beautifulsoup, my issue is how do I navigate to the 'modal-body' class of the pop-up window, and then use beautiful soup:

#then pull down all the info to a json file?
        full_table = driver.find_element_by_class_name("modal-body")
        soup = BeautifulSoup(full_table,'html.parser')
        print(soup)   

returns the error:

    soup = BeautifulSoup(full_table,'html.parser')
  File "/Users/kela/anaconda/envs/selenium_scripts/lib/python3.6/site-packages/bs4/__init__.py", line 287, in __init__
    elif len(markup) <= 256 and (
TypeError: object of type 'WebElement' has no len()

Update 3: Then I tried to scrape the page using ONLY beautifulsoup:

from bs4 import BeautifulSoup 
import requests

url = 'http://mahmi.org/explore.php?filterType=&filter=&page=1'
html_doc = requests.get(url).content
soup = BeautifulSoup(html_doc, 'html.parser')
container = soup.find("div", {"class": "modal-body"})
print(container)

and it prints:

<div class="modal-body">
<h4><b>Reference information</b></h4>
<p>Id: <span id="info-ref-id">XXX</span></p>
<p>Bioactivity: <span id="info-ref-bio">XXX</span></p>
<p><a id="info-ref-seq">Download sequence</a></p><br/>
<h4><b>Source proteins</b></h4>
<div id="source-proteins"></div>
</div>

But this is not the output that I want, as it's not printing the json layers (e.g. there is more info beneath the source-proteins div).

Update 4, when I add to the original code above (before the updates):

full_table = driver.find_element_by_class_name("modal-body")
with open('test_outputfile.json', 'w') as output:
    json.dump(full_table, output)

The output is 'TypeError: Object of type 'WebElement' is not JSON serializable', which I'm trying to figure out now.

Update 5: Trying to copy this approach, I added:

full_div = driver.find_element_by_css_selector('div.modal-body')
for element in full_div:
    new_element = element.find_element_by_css_selector('<li>Investigation type: metagenome</li>')
    print(new_element.text)

(where I just added the li element just to see if it would work), but I get the error:

Traceback (most recent call last):
  File "scrape_mahmi.py", line 28, in <module>
    for element in full_div:
TypeError: 'WebElement' object is not iterable

Update 6: I tried looping through ul/li elements, because I saw that what I wanted were li text embedded in a ul in a li in a ul in a div; so I tried:

html_list = driver.find_elements_by_tag_name('ul')
for each_ul in html_list:
       items = each_ul.find_elements_by_tag_name('li')
       for item in items:
               next_ul = item.find_elements_by_tag_name('ul')
               for each_ul in next_ul:
                       next_li = each_ul.find_elements_by_tag_name('li')
                       for each_li in next_li:
                               print(each_li.text)

There's no error for this, I just get no output.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Slowat_Kela
  • 1,377
  • 2
  • 22
  • 60
  • Does this answer your question? [Convert a HTML Table to JSON](https://stackoverflow.com/questions/18544634/convert-a-html-table-to-json) – Naveen Mar 24 '20 at 14:32
  • This would be great, except that I don't know how to link it to my example (i.e. i can understand this for a static page, but not how to embed this in with clicking to the right table, and then identifying the class 'modal-body', which probably has to be done with selenium and not beautifulsoup. Thank you. – Slowat_Kela Mar 24 '20 at 14:45
  • I'm updating my original question to show specifically what I don't understand about this method. – Slowat_Kela Mar 24 '20 at 14:51

2 Answers2

0

You dont have to click with the text visible. You can generate generic xpaths like :

"(//table//td[1])//button[@data-target]"

This will detect all buttons in the first column of the table. So you can go on loop.

count= len(driver.find_elements_by_xpath("(//table//td[1])//button[@data-target]"))
for i in range(count):
    driver.find_element_by_xpath("((//table//td[1])//button[@data-target])[" + str(i+1) + "]").click()
    # to get text content from pop up window 
    text = driver.find_element_by_xpath("//div[@class='modal-content']").text
    # then click close 
    driver.find_element_by_xpath("//button[text()='Close']").click()
Naveen
  • 770
  • 10
  • 22
  • Thank you, when I run exactly your answer, I get 'selenium.common.exceptions.ElementNotInteractableException: Message: element not interactable' referring to the last line (the 'Close' line), so then I commented out the close line and then there's an error with the line that clicks the target (just below for i in range(count)) and it says: 'selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element – Slowat_Kela Mar 24 '20 at 14:37
  • can you try increasing the implicit wait time or add some explicit waits for elements to be visible? – Naveen Mar 24 '20 at 18:08
  • the XHR gives me url :- http://mahmi.org/api/peptides/sourceProteins/241282699 which is constant uptil http://mahmi.org/api/peptides/sourceProteins + Peptide ID so your request would be Constant + Peptide_ID let me know if i am making sense – Prakhar Jhudele Mar 24 '20 at 18:54
0

I do not know if you found the answer but I was talking about the approach where selenium is not required. So you can get the XHR for each peptide to get the details from modal box. Although be careful this is just a rough outline you need put the items in a json dumps or whichever way you like. Here is my approach.

from bs4 import BeautifulSoup
import pandas as pd
import requests
from xml.etree import ElementTree as et
import xmltodict


url = "http://mahmi.org/explore.php?filterType=&filter=&page=1"
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
headers = {
    "Connection": "keep-alive",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}

pep_ids = df['ID'].tolist()
#pep_ids = ['270516746','268297434'] ## You can use this first to check output

base_url= 'http://mahmi.org/api/peptides/sourceProteins/'
for pep_id in pep_ids:
    final_url = base_url + str(pep_id)
    page = requests.get(final_url, headers=headers)
    tree = et.fromstring(page.content)
    for child in tree.iter('*'):
        print(child.tag,child.text)
Prakhar Jhudele
  • 955
  • 1
  • 7
  • 14