how do we select the child element tbody after extracting the entire html?

Question

I'm still a python noob trying to learn beautifulsoup.I looked at solutions on stack but was unsuccessful Please help me to understand this better. i have extracted the html which is as shown below

<table cellspacing="0" id="ContentPlaceHolder1_dlDetails" 
     style="width:100%;border-collapse:collapse;">
     <tbody><tr>
     <td>
     <table border="0" cellpadding="5" cellspacing="0" width="70%">
     <tbody><tr>
     <td> </td>
     <td> </td>
     </tr>
     <tr>
     <td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
     <td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
     </tr>
     <tr>
     <td class="listmaintext">ATM ID: </td>
     <td class="listmaintext">DAGR00401111111</td>
     </tr>
     <tr>
     <td class="listmaintext">ATM Centre:</td>
     <td class="listmaintext"></td>
     </tr>
     <tr>
     <td class="listmaintext">Site Location: </td>
     <td class="listmaintext">ADA Building - Agra</td>
     </tr>

i tried to parse find_all('tbody') but was unsuccessful

        #table = bs.find("table", {"id": "ContentPlaceHolder1_dlDetails"})
        html = browser.page_source
        soup = bs(html, "lxml")
        table = soup.find_all('table', {'id':'ContentPlaceHolder1_dlDetails'})
        table_body = table.find('tbody')
        rows = table.select('tr')
        for row in rows:
            cols = row.find_all('td')
            cols = [ele.text.strip() for ele in cols]
            data.append([ele for ele in cols if ele])values

I'm trying to save values in "listmaintext" class

Error message AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

It would help if you indicated desired output format. Above you seem to want a single list. — QHarr, Jun 19 '19 at 09:46
the desired output format is ATM ID: DAGR00401111111 ATM Centre: Site Location: ADA Building - Agra Link Branch: Sol ID: 54000 State: Uttar Pradesh District: Agra Off Site Address: On Site Address: Union Bank of India, Adra Development Authority- Agra Branch, Agra Development Authority, Agra, Jaipur House, Loha Mandi, Uttar Pradesh - 282010 Pin Code: in csv or json as i have multiple htmls with same headers — Peaches, Jun 19 '19 at 09:47

score 0 · Answer 1 · answered Jun 19 '19 at 08:24

from bs4 import BeautifulSoup

data = '''<table cellspacing="0" id="ContentPlaceHolder1_dlDetails"
     style="width:100%;border-collapse:collapse;">
     <tbody><tr>
     <td>
     <table border="0" cellpadding="5" cellspacing="0" width="70%">
     <tbody><tr>
     <td> </td>
     <td> </td>
     </tr>
     <tr>
     <td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
     <td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
     </tr>
     <tr>
     <td class="listmaintext">ATM ID: </td>
     <td class="listmaintext">DAGR00401111111</td>
     </tr>
     <tr>
     <td class="listmaintext">ATM Centre:</td>
     <td class="listmaintext"></td>
     </tr>
     <tr>
     <td class="listmaintext">Site Location: </td>
     <td class="listmaintext">ADA Building - Agra</td>
     </tr>'''

soup = BeautifulSoup(data, 'lxml')

s = soup.select('.listmaintext')
for td1, td2 in zip(s[::2], s[1::2]):
    print('{} [{}]'.format(td1.text.strip(), td2.text.strip()))

Prints:

ATM ID: [DAGR00401111111]
ATM Centre: []
Site Location: [ADA Building - Agra]

Thank you soo much. can you please point me towards resource which explains why we used ``` (s[::2], s[1::2]) ``` i really want to learn. Thank you in advance — Peaches, Jun 19 '19 at 08:28
@Amir It's Python's slice notation (https://stackoverflow.com/questions/509211/understanding-slice-notation) [::2] -> means take every second item from the list (starting from 0) [1::2] -> take every second item starting from index 1 — Andrej Kesely, Jun 19 '19 at 08:31

score 0 · Accepted Answer · answered Jun 19 '19 at 09:54

0

Another way to do this using next_sibling

from bs4 import BeautifulSoup as bs

html ='''
<html>
<table cellspacing="0" id="ContentPlaceHolder1_dlDetails" 
     style="width:100%;border-collapse:collapse;">
     <tbody><tr>
     <td>
     <table border="0" cellpadding="5" cellspacing="0" width="70%">
     <tbody><tr>
     <td> </td>
     <td> </td>
     </tr>
     <tr>
     <td bgcolor="#4F95FF" class="listhead" width="49%">Location:</td>
     <td bgcolor="#4F95FF" class="listhead" width="51%">On Site </td>
     </tr>
     <tr>
     <td class="listmaintext">ATM ID: </td>
     <td class="listmaintext">DAGR00401111111</td>
     </tr>
     <tr>
     <td class="listmaintext">ATM Centre:</td>
     <td class="listmaintext"></td>
     </tr>
     <tr>
     <td class="listmaintext">Site Location: </td>
     <td class="listmaintext">ADA Building - Agra</td>
     </tr>
</html>'''

soup = bs(html, 'lxml')
data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !='']
print(data)

answered Jun 19 '19 at 09:54

QHarr

83,427
12
54
101

I tried the above, i seem to be getting an empty array am i doing something wrong? html = browser.page_source soup = bs(html, "lxml") data = [' '.join((item.text, item.next_sibling.next_sibling.text)) for item in soup.select('#ContentPlaceHolder1_dlDetails tr .listmaintext:first-child') if item.text !=''] print(data) – Peaches Jun 19 '19 at 10:12
Can you provide the source url? I'm also using bs4 4.7.1 though you don't appear to be getting a version related error are you? – QHarr Jun 19 '19 at 10:15
No i have the same version. url https://eremit.unionbankofindia.co.in/livebranch/ATMList.aspx – Peaches Jun 19 '19 at 10:23
search for india and click on the first link. i am trying to extract all the information for all 188 pages – Peaches Jun 19 '19 at 10:24
It works for me. Do you need a wait condition to ensure data present? If I simply transfer page html I get: ['ATM ID: DAGR00401111111', 'ATM Centre: ', 'Site Location: ADA Building - Agra', 'Link Branch: ', 'Sol ID: 54000', 'State: Uttar Pradesh', 'District: Agra', 'Off Site Address: ', 'On Site Address: Union Bank of India, Adra Development Authority- Agra Branch, Agra Development Authority, Agra, Jaipur House, Loha Mandi, Uttar Pradesh - 282010', 'Pin Code: '] – QHarr Jun 19 '19 at 10:33
Let me restart my kernel and reload. will i be able to loop all 188 pages with this? won't it be slow? – Peaches Jun 19 '19 at 10:38
Is this happening cause i'm using html = browser.page_source as input for my lxml? my output is still empty array. i even updated my bs4 to 4.7.1 – Peaches Jun 19 '19 at 10:49
No. That shouldn't be the problem. I will have a go at writing something later today from scratch. – QHarr Jun 19 '19 at 10:50
adding comma between css selector gave me this output ['\n\n\n\n\n\xa0\n\xa0\n\n\nLocation:\nOn Site \n\n\nATM ID: \nDAGR00401111111\n\n\nATM Centre:\n\n\n\nSite Location: \nADA Building - Agra\n\n\nLink Branch:\n\n\n\nSol ID: \n54000\n\n\nState:\nUttar Pradesh\n\n\nDistrict:\nAgra\n\n\nOff Site Address:\n\n\n\nOn Site Address:\nUnion Bank of India, Adra Development Authority- Agra Branch, Agra Development Authority, Agra, Jaipur House, Loha Mandi, Uttar Pradesh - 282010\n\n\nPin Code:\n\n\n\n\n\n \n\n\xa0\n\n\n\nBack\n\n'] – Peaches Jun 19 '19 at 10:59
That shouldn't be a solution as it changes the meaning of the selector. But interesting that that does happen. In which case you could sanitize the output. – QHarr Jun 19 '19 at 11:04
i don't understand how the both of us got different outputs. I will try to sanitize my output. Thank you soo much for your time. – Peaches Jun 19 '19 at 11:06
Thank you soo much QHarr. You helped me out last time as well to append the files. Thank you soo much for helping me accelerate my learning. – Peaches Jun 19 '19 at 11:20
No started cleaning the output. The suggestion from @Andrej worked. Writing out to a txt file and then converting it to dictionary. Is there a better solution to this? I couldn’t figure why it had different output on our systems. – Peaches Jun 19 '19 at 14:26
I reinstalled beautifulsoup (there was no version change) restarted kernel. It worked like a charm. Thank you sooo much. – Peaches Jun 19 '19 at 14:43
Wonderful. Glad to hear it. – QHarr Jun 19 '19 at 15:21

how do we select the child element tbody after extracting the entire html?

2 Answers2