0

I have a table with several columns. The last column may contain link to documents, number of links per cell is not determined (from 0 to infinity).

<tbody>
     <tr>
        <td>
          <h2>Table Section</h2>
        </td>
      </tr>
    <tr>
      <td>
        <a href="#">Object 1</a>
      </td>
      <td>Param 1</td>
      <td>
        <span class="text-nowrap">Param 2</span>        
      </td>
      <td class="text-nowrap"></td>
    </tr>

    <tr>
      <td>
        <a href="#">Object 2</a>
      </td>
      <td>Param 1</td>
      <td>
        <span class="text-nowrap">Param 2</span>
      <td>
          <ul>
            <li>
              <small>
                <a href="link_to.doc">Title</a>Notes
              </small>
            </li>

            <li>
              <small>
                <a href="another_link_to.doc">Title2</a>Notes2
              </small>
            </li>
          </ul>
      </td>
    </tr>
</tbody>

So basic parsing is not a problem. I'm stuck with getting those links with titles and notes and appending them tor python's list (or numpy array).

from bs4 import BeautifulSoup

with open("new 1.html", encoding="utf8") as dump:
    soup = BeautifulSoup(dump, features="lxml")

data = []

table_body = soup.find('tbody')
rows = table_body.find_all('tr')

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append(cols)
    a = row.find_all('a')
    for ele1 in a:
        if ele1.get('href') != "#":
            data.append([ele1.get('href')])
print(*data, sep='\n')

Output:

['Table Section']
['Object 1', 'Param 1', 'Param 2', '']
['Object 2', 'Param 1', 'Param 2', 'TitleNotes\n\t\t\t  \n\n\n\nTitle2Notes2']
['link_to.doc']
['another_link_to.doc']

Is there any way to append links to the first list? I wish a list for a second row looked like this:

['Object 2', 'Param 1', 'Param 2', 'Title', 'Notes', 'link_to.doc', ' Title2', 'Notes2', 'another_link_to.doc']
deff
  • 1
  • 1
  • 3
    Post your python code pls – Wonka Oct 17 '19 at 16:07
  • Here are some tutorials to help https://www.digitalocean.com/community/tutorials/how-to-scrape-web-pages-with-beautiful-soup-and-python-3 and https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/ – Eric Leung Oct 17 '19 at 16:10
  • 3
    Possible duplicate of [Parsing HTML using Python](https://stackoverflow.com/questions/11709079/parsing-html-using-python) – Edeki Okoh Oct 17 '19 at 16:45

1 Answers1

0

Something like this

from bs4 import BeautifulSoup


html = '''<tbody>
     <tr>
        <td>
          <h2>Table Section</h2>
        </td>
      </tr>
    <tr>
      <td>
        <a href="#">Object 1</a>
      </td>
      <td>Param 1</td>
      <td>
        <span class="text-nowrap">Param 2</span>        
      </td>
      <td class="text-nowrap"></td>
    </tr>

    <tr>
      <td>
        <a href="#">Object 2</a>
      </td>
      <td>Param 1</td>
      <td>
        <span class="text-nowrap">Param 2</span>
      <td>
          <ul>
            <li>
              <small>
                <a href="link_to.doc">Title</a>Notes
              </small>
            </li>

            <li>
              <small>
                <a href="another_link_to.doc">Title2</a>Notes2
              </small>
            </li>
          </ul>
      </td>
    </tr>
</tbody>'''


soup = BeautifulSoup(html, features="lxml")
smalls = soup.find_all('small')
links = [s.contents[1].attrs['href'] for s in smalls]
print(links)

output

['link_to.doc', 'another_link_to.doc']
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Is there any way to append links to the first list? I wish a list for a second row looked like this: ['Object 2', 'Param 1', 'Param 2', 'Title', 'Notes', 'link_to.doc', ' Title2', 'Notes2', 'another_link_to.doc'] – deff Oct 17 '19 at 19:13