trying to extract data and want to save in excel but getting error using python beautifulsoup

Question

trying to extract but last in last field getting error want to save all fields in excel.

i have tried using beautifulsoup to extract but fails to catch, getting below error

Traceback (most recent call last):

File "C:/Users/acer/AppData/Local/Programs/Python/Python37/agri.py", line 30, in

specimens = soup2.find('h3',class_='trigger

expanded').find_next_sibling('div',class_='collapsefaq-content').text

AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

from bs4 import BeautifulSoup
import requests

page1 = requests.get('http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases')

soup1 = BeautifulSoup(page1.text,'lxml')

for lis in soup1.find_all('li',class_='flex-item'):
    diseases = lis.find('img').next_sibling
    print("Diseases: " + diseases)
    image_link = lis.find('img')['src']
    print("Image_Link:http://www.agriculture.gov.au" + image_link)
    links = lis.find('a')['href']
    if links.startswith("http://"):
        link = links
    else:
        link = "http://www.agriculture.gov.au" + links
    page2 = requests.get(link)
    soup2 = BeautifulSoup(page2.text,'lxml')

    try:
        origin = soup2.find('strong',string='Origin: ').next_sibling
        print("Origin: " + origin)
    except:
        pass
    try:
        imported = soup2.find('strong',string='Pathways: ').next_sibling
        print("Imported: " + imported)
    except:
        pass 
    specimens = soup2.find('h3',class_='trigger expanded').find_next_sibling('div',class_='collapsefaq-content').text
    print("Specimens: " + specimens)

want to extarct that last field and to save all fields into excel sheet using python, plz help me anyone.

score 1 · Answer 1 · answered Apr 19 '19 at 16:02

1

Minor typo:

   data2,append("Image_Link:http://www.agriculture.gov.au" + image_link)

Should be:

   data2.append("Image_Link:http://www.agriculture.gov.au" + image_link) #period instead of a comma

answered Apr 19 '19 at 16:02

Nick Vitha

466
2
9

QHarr · Accepted Answer · 2019-04-20T09:04:35.883

It seems to want headers to prevent being blocked and also there is not a specimens section for each page. The following shows possible handling for each page for the specimen info

from bs4 import BeautifulSoup
import requests
import pandas as pd

base = 'http://www.agriculture.gov.au'
headers = {'User-Agent' : 'Mozilla/5.0'}
specimens = []
with requests.Session() as s:
    r = s.get('http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases', headers = headers)
    soup = BeautifulSoup(r.content, 'lxml')
    names, images, links = zip(*[ ( item.text.strip(), base + item.select_one('img')['src'] , item['href'] if 'http' in item['href'] else base + item['href']) for item in soup.select('.flex-item > a') ])
    for link in links:
        r = s.get(link)
        soup = BeautifulSoup(r.content, 'lxml')
        if soup.select_one('.trigger'): # could also use if soup.select_one('.trigger:nth-of-type(3) + div'):
            info = soup.select_one('.trigger:nth-of-type(3) + div').text
        else:
            info = 'None'
        specimens.append(info)

df = pd.DataFrame([names, images, links, specimens])
df = df.transpose()
df.columns  = ['names', 'image_link', 'link', 'specimen']
df.to_csv(r"C:\Users\User\Desktop\Data.csv", sep=',', encoding='utf-8-sig',index = False )

I have run the above lots of times without problem, however, you can always switch my current test to a try except block.

from bs4 import BeautifulSoup
import requests
import pandas as pd

base = 'http://www.agriculture.gov.au'
headers = {'User-Agent' : 'Mozilla/5.0'}
specimens = []
with requests.Session() as s:
    r = s.get('http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases', headers = headers)
    soup = BeautifulSoup(r.content, 'lxml')
    names, images, links = zip(*[ ( item.text.strip(), base + item.select_one('img')['src'] , item['href'] if 'http' in item['href'] else base + item['href']) for item in soup.select('.flex-item > a') ])
    for link in links:
        r = s.get(link)
        soup = BeautifulSoup(r.content, 'lxml')
        try:
            info = soup.select_one('.trigger:nth-of-type(3) + div').text
        except:
            info = 'None'
            print(link)
        specimens.append(info)

df = pd.DataFrame([names, images, links, specimens])
df = df.transpose()
df.columns  = ['names', 'image_link', 'link', 'specimen']

Example of csv output:

getting error " File "C:\Users\acer\AppData\Local\Programs\Python\Python37\agri.py", line 16, in info = soup.select_one('.trigger:nth-of-type(3) + div').text AttributeError: 'NoneType' object has no attribute 'text' " — , Apr 20 '19 at 02:21
Did you run it exactly as is above? (Only changing the output filepath). I have run it 5 times in a row no problem. Added a second version using try except instead — QHarr, Apr 20 '19 at 05:51
i run your above code and gives links in output screen and this error "df.to_csv(r"‪C:\Users\acer\Desktop\Data.csv", sep=',', encoding='utf-8-sig',index = False )" "OSError: [Errno 22] Invalid argument:'\u202aC:\\Users\\acer\\Desktop\\Data.csv'" — , Apr 20 '19 at 06:18
What OS are you on (that line is written for Windows paths)? Also, try simply print(df) and comment out csv write line at end. Does df print out as expected? — QHarr, Apr 20 '19 at 06:28
i am using windows 10 , after print(df) getting output as all names in names columns , then second column contains 3 dots and 3 column contains none in specimen column — , Apr 20 '19 at 08:52
Not sure what else to suggest as work perfectly every time for me. — QHarr, Apr 20 '19 at 09:02
https://stackoverflow.com/questions/25584124/oserror-errno-22-invalid-argument-when-use-open-in-python — QHarr, Apr 22 '19 at 12:39

trying to extract data and want to save in excel but getting error using python beautifulsoup

2 Answers2