1

I'm trying to get all the text for a specific class, but it is returning an empty list:

>>> soup.find_all(' dataRow odd')
[]

html:

<tr class=" dataRow odd" onblur="if (window.hiOff){hiOff(this);}" 
onfocus="if (window.hiOn){hiOn(this);}" onmouseout="if (window.hiOff){hiOff(this);}" 
onmouseover="if (window.hiOn){hiOn(this);}"><td class='actionColumn'>&nbsp;</td><th scope="row" class=" dataCell  ">
<a href="/a0I9000000hHJIN?btdid=0019000001piFE9">textexttext</a></th><td class=" dataCell  ">Active</td><td class=" dataCell  ">
<a href="/a089000001nOvG8?btdid=0019000001piFE9">BIG TEXT/a></td>
<td class=" dataCell  ">TEXTTEXTTEXT</td><td class=" dataCell  ">TEXTTEXTTEXT</td>
<td class=" dataCell  "> </td><td class=" dataCell  ">&nbsp;</td><td class=" dataCell  DateElement">8/02/2019</td></tr>

I'm trying to grab ALL text within that code. But when I run my code it returns [] as if it didn't find anything.

import requests, bs4, re
html = open('2.html')
soup = bs4.BeautifulSoup(exampleFile, "lxml")
duh = soup .find_all(' dataRow odd')
print (duh)

Where am I going wrong? Also, ideally the code would spit out all the separate text on different lines

Peter Wood
  • 23,859
  • 5
  • 60
  • 99
Alex
  • 11
  • 1
  • 1
    I believe your `findAll()` is being given the wrong argument. You would need to `findAll('tr', {"class": ' dataRow odd'})`. As in [this](https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class) question. – ktb Jun 11 '17 at 05:30
  • Thanks, Problem is it now spits out the entire code. I'm trying to isolate just the text and print it just from the text – Alex Jun 11 '17 at 06:00
  • Python doesn't have `nil`. See [Ruby use case for nil, equivalent to Python None or JavaScript undefined](https://stackoverflow.com/questions/3884004/ruby-use-case-for-nil-equivalent-to-python-none-or-javascript-undefined) and [What is closer to python None: nil or NULL?](https://stackoverflow.com/questions/25498810/what-is-closer-to-python-none-nil-or-null) – Peter Wood Jun 11 '17 at 06:14
  • RTFM before asking. bs4 manual is straightforward – internety Jun 11 '17 at 08:23
  • Hi yeah have read the manual, still need help, thanks anyway. Yes, I meant none. it prints as [] – Alex Jun 11 '17 at 10:51

1 Answers1

0

Querying for dataRow odd yields the surrounding <tr> which includes all other elements within, <td> and <a> etc. You can grab just the text by accessing the .text property like so, it will give you just a big blob of text instead of HTML:

for d in duh:
    print d.text

Instead of that, you can fetch all <td> elements within that <tr> separately, and grab the .text from each individual element.

import requests, bs4, re

html = open('test.html')
soup = bs4.BeautifulSoup(html, "html.parser") # use html parser instead of XML
duh = soup.find_all('tr', {'class':' dataRow odd'}) # using ktb's suggestion from comments
for d in duh:
    tds = d.find_all()
    for td in tds:
        cleaned = td.text.strip().rstrip('\n') # remove newlines and spaces
        if cleaned != '':
            print cleaned
chrki
  • 6,143
  • 6
  • 35
  • 55