0

I've a table on a website much like this:

<table class="table-class">
  <thead>
    <tr>
      <th>Col 1</th>
      <th>Col 2</th>
      <th>Col 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
     <td>Hello</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
   <tr>
     <td>there</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
  </tbody>
</table>

Ultimately, what I would like to do is to read the content of each td for each row and produce a string containing all three cells for each respective row. Furthermore, I would like for this to scale to handle larger tables from numerous websites using the same design, so speed is somewhat of a priority, but not a necessity.

I assume I have to use something like find_elements_by_xpath(...) or similar, but I'm really hitting a wall with this. I've attempted several approaches suggested on other sites and seem to do more things wrong than right. Any sort of suggestion or idea would be hugely appreciated!

What I currently have, although non-functioning and based on another question from here, is:

listoflist = [[td.text
                for td in tr.find_elements_by_xpath('td')]
                for tr in driver.find_elements_by_xpath("//table[@class='table-class')]//tr"]
listofdict = [dict(zip(list_of_lists[0],row)) for row in list_of_lists[1:]]

Thanks in advance!

vham

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
vham
  • 101
  • 1
  • 1
  • 10

2 Answers2

1

If you are familiar with DOM (Document Object Model), then you can use the answers in this post and use BeautifulSoup library to load html in DOM format. After that you can simply find instances of <tr> and foreach one of those instances find all the respective <td> tags inside. Think of DOM as a tree structure where branching happens at nested tags.

Community
  • 1
  • 1
Amir Zadeh
  • 3,481
  • 2
  • 26
  • 47
1

Depending on the website you are trying to access, you might not need to go as far as needing selenium. You could just access the html using requests.

For the HTML you have given, uou could use BeautifulSoup to extract the table information as follows:

from bs4 import BeautifulSoup

html = """
<table class="table-class">
  <thead>
    <tr>
      <th>Col 1</th>
      <th>Col 2</th>
      <th>Col 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
     <td>Hello</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
   <tr>
     <td>there</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
  </tbody>
</table>"""

soup = BeautifulSoup(html, "html.parser")
rows = []

for tr in soup.find_all('tr'):
    cols = []
    for td in tr.find_all(['td', 'th']):
        td_text = td.get_text(strip=True)
        if len(td_text):
            cols.append(td_text)
    rows.append(cols)

print rows

Giving you rows holding:

[[u'Col 1', u'Col 2', u'Col 3'], [u'Hello', u'A number', u'Another number'], [u'there', u'A number', u'Another number']]

To use requests, it would start something like:

import requests            

response = requests.get(url)
html = response.text
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • The `u'xxxx'` means it is a unicode string. BeautifulSoup always first converts everything to unicode. As you say, you can convert it if required. – Martin Evans Mar 22 '17 at 17:44