0

For the following piece of HTML code, I used beautifulsoup to capture the table information:

<table>
<tr>
<td><b>Code</b></td>
<td><b>Display</b></td>
</tr>
<tr>
<td>min</td>
<td>Minute</td><td/>
</tr>
<tr>
<td>happy </td>
<td>Hour</td><td/>
</tr>
<tr>
<td>daily </td>
<td>Day</td><td/>
</tr>

This is my code:

comments = [td.get_text()  for td in table.findAll("td")]
Comments=[data.encode('utf-8')  for data in comments] 

As you see, this table has two headers: "code and display" and some values in rows. The expected output of my code should be [code, display, min, minutes, happy, Hour, daily, day]

but this is the output:

['Code', 'Display', 'min', 'Minute', '', 'happy ', 
'Hour', '', 'daily ', 'Day', '']

The output has '' in 5th, 8th, and 11th indices in comments that are not defined in this table. I think it may because of </td><td/>. How can I change the code to not capture u'' in the output?

Vertexwahn
  • 7,709
  • 6
  • 64
  • 90
Mary
  • 1,142
  • 1
  • 16
  • 37
  • @Noah, my problem is not 'u'. It is u' ' in the output list. After I turn the code to string using the following code: 'Comments=[data.encode('utf-8') for data in comments]', this is the output: ['Code', 'Display', 'min', 'Minute', '', 'happy ', 'Hour', '', 'daily ', 'Day', ''] , can you see the extra output in 5th, and 10th indices ? – Mary Jun 01 '16 at 00:19

1 Answers1

1

Sorry, I hadn't read your question carefully enough. You're right, the problem is the empty <td/> tags. Just adjust your generator to only include cells with text:

comments = [td.get_text() for td in table.findAll('td') if td.text]


EDIT: I doubt this is the most efficient way to do it, but this will only include tds that have either text or a corresponding td in the first row.
ths = table.tr.find_all('td')
tds_in_row = len(table.tr.next_sibling.find_all('td'))

tds = [
    td.get_text()
    for i, td in enumerate(table.find_all('td'))
    if len(ths) > (i + 1) % tds_in_row or td.text
]
Noah
  • 1,329
  • 11
  • 21
  • Thank you so much ! – Mary Jun 01 '16 at 00:28
  • Sorry Noah, can you give me another solution? Because if I use the code you offered , it will not capture some of the null values (for example for display in other tables) that I really want to capture them. do you think I can remove the from the table tags? – Mary Jun 01 '16 at 00:49
  • Can you give examples of when you would and would not want to keep the values? – Noah Jun 01 '16 at 01:30
  • for example in the following url, definition column of the table in the section of content logical definition has null values. https://www.hl7.org/fhir/valueset-contract-term-type.html But for the following URL: https://www.hl7.org/fhir/valueset-age-units.html Table (in the content logical definition section) returns null values while no column defined in the table that includes null value. Thank you so much for your help ! – Mary Jun 01 '16 at 15:53
  • Thank you so much ! For this part of your code "tds_in_row = len(table.tr.next_sibling.find_all('td'))", this is the error:'NavigableString' object has no attribute 'find_all – Mary Jun 03 '16 at 01:27
  • @Mary It's probably picking up newline characters ('\n') in your table string. Either make sure that the table doesn't contain any line breaks, or use find_next_sibling('tr') instead of next_sibling. – Noah Jun 03 '16 at 01:46