0

I am scraping a website to get the company and product details. It has the div tag in which there is li tag and I want to get all the li tag within the div tag. I am using python 3.5.1 and BeautifulSoup

My code:

from bs4 import BeautifulSoup
import urllib.request
import re
r = urllib.request.urlopen('http://i.cantonfair.org.cn/en/ExpExhibitorList.aspx?k=glassware')
soup = BeautifulSoup(r, "html.parser")

links = soup.find_all("a", href=re.compile(r"expexhibitorlist\.aspx\?categoryno=[0-9]+"))
linksfromcategories = ([link["href"] for link in links])

string = "http://i.cantonfair.org.cn/en/"
linksfromcategories = [string + x for x in linksfromcategories]

for link in linksfromcategories:
    response = urllib.request.urlopen(link)
    soup2 = BeautifulSoup(response, "html.parser")
    links2 = soup2.find_all("a", href=re.compile(r"\ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
    linksfromsubcategories = ([link["href"] for link in links2])
    linksfromsubcategories = [string + x for x in linksfromsubcategories]
    for link in linksfromsubcategories:
        response = urllib.request.urlopen(link)
        soup3 = BeautifulSoup(response, "html.parser")
        links3 = soup3.find_all("a", href=re.compile(r"\ExpExhibitorList\.aspx\?categoryno=[0-9]+"))
        linksfromsubcategories2 = ([link["href"] for link in links3])
        linksfromsubcategories2 = [string + x for x in linksfromsubcategories2]
        for link in linksfromsubcategories2:
            response2 = urllib.request.urlopen(link)
            soup4 = BeautifulSoup(response2, "html.parser")
            companylink = soup4.find_all("a", href=re.compile(r"\expCompany\.aspx\?corpid=[0-9]+"))
            companylink = ([link["href"] for link in companylink])
            companylink = [string + x for x in companylink]
            for link in companylink:
                response3 = urllib.request.urlopen(link)
                soup5 = BeautifulSoup(response3, "html.parser")
                companydetail = soup5.find_all("div", id="contact")
                for element in companydetail:
                    companyname = element.a[0].get_text()
                    print (companyname)
                    companyaddress = element.a[1].get_text()
                    print (companyaddress)And I am getting error

And I am getting error

Traceback (most recent call last):
  File "D:\python\phase3.py", line 54, in <module>
    lis = companydetail.find_all('li')
AttributeError: 'ResultSet' object has no attribute 'find_all'
Mauro Baraldi
  • 6,346
  • 2
  • 32
  • 43
Aman Kumar
  • 1,572
  • 4
  • 17
  • 25
  • It says there's an error on line 54, but you've only included 37 lines, none of which contain the code that is throwing the error. – wpercy Feb 26 '16 at 19:47

1 Answers1

1

companydetail is a ResultSet. That is to say, it's an iterable object that contains many elements (like a list or a set). The error is occurring because you try to call .find_all() on this ResultSet object. You should be iterating through this object like this and calling find_all() on the elements in the ResultSet:

for d in companydetail:
    lis = d.find_all('li')

Or to get a list of all lis in companydetail using list comprehension:

lis = [ li for d.find_all('li') for d in companydetail ]
wpercy
  • 9,636
  • 4
  • 33
  • 45
  • What do you mean by "getting the list twice"? – wpercy Feb 26 '16 at 20:10
  • Like from li atg I get company details such as name and email id but that name and email id is getting twice. May be I am scraping the url twice or something ? – Aman Kumar Feb 26 '16 at 20:18