0

trying to extract but last in last field getting error want to save all fields in excel.

i have tried using beautifulsoup to extract but fails to catch, getting below error

Traceback (most recent call last):

File "C:/Users/acer/AppData/Local/Programs/Python/Python37/agri.py", line 30, in

specimens = soup2.find('h3',class_='trigger

expanded').find_next_sibling('div',class_='collapsefaq-content').text

AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

from bs4 import BeautifulSoup
import requests

page1 = requests.get('http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases')

soup1 = BeautifulSoup(page1.text,'lxml')

for lis in soup1.find_all('li',class_='flex-item'):
    diseases = lis.find('img').next_sibling
    print("Diseases: " + diseases)
    image_link = lis.find('img')['src']
    print("Image_Link:http://www.agriculture.gov.au" + image_link)
    links = lis.find('a')['href']
    if links.startswith("http://"):
        link = links
    else:
        link = "http://www.agriculture.gov.au" + links
    page2 = requests.get(link)
    soup2 = BeautifulSoup(page2.text,'lxml')

    try:
        origin = soup2.find('strong',string='Origin: ').next_sibling
        print("Origin: " + origin)
    except:
        pass
    try:
        imported = soup2.find('strong',string='Pathways: ').next_sibling
        print("Imported: " + imported)
    except:
        pass 
    specimens = soup2.find('h3',class_='trigger expanded').find_next_sibling('div',class_='collapsefaq-content').text
    print("Specimens: " + specimens)

want to extarct that last field and to save all fields into excel sheet using python, plz help me anyone.

QHarr
  • 83,427
  • 12
  • 54
  • 101

2 Answers2

1

Minor typo:

   data2,append("Image_Link:http://www.agriculture.gov.au" + image_link)

Should be:

   data2.append("Image_Link:http://www.agriculture.gov.au" + image_link) #period instead of a comma
Nick Vitha
  • 466
  • 2
  • 9
0

It seems to want headers to prevent being blocked and also there is not a specimens section for each page. The following shows possible handling for each page for the specimen info

from bs4 import BeautifulSoup
import requests
import pandas as pd

base = 'http://www.agriculture.gov.au'
headers = {'User-Agent' : 'Mozilla/5.0'}
specimens = []
with requests.Session() as s:
    r = s.get('http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases', headers = headers)
    soup = BeautifulSoup(r.content, 'lxml')
    names, images, links = zip(*[ ( item.text.strip(), base + item.select_one('img')['src'] , item['href'] if 'http' in item['href'] else base + item['href']) for item in soup.select('.flex-item > a') ])
    for link in links:
        r = s.get(link)
        soup = BeautifulSoup(r.content, 'lxml')
        if soup.select_one('.trigger'): # could also use if soup.select_one('.trigger:nth-of-type(3) + div'):
            info = soup.select_one('.trigger:nth-of-type(3) + div').text
        else:
            info = 'None'
        specimens.append(info)

df = pd.DataFrame([names, images, links, specimens])
df = df.transpose()
df.columns  = ['names', 'image_link', 'link', 'specimen']
df.to_csv(r"C:\Users\User\Desktop\Data.csv", sep=',', encoding='utf-8-sig',index = False ) 

I have run the above lots of times without problem, however, you can always switch my current test to a try except block.

from bs4 import BeautifulSoup
import requests
import pandas as pd

base = 'http://www.agriculture.gov.au'
headers = {'User-Agent' : 'Mozilla/5.0'}
specimens = []
with requests.Session() as s:
    r = s.get('http://www.agriculture.gov.au/pests-diseases-weeds/plant#identify-pests-diseases', headers = headers)
    soup = BeautifulSoup(r.content, 'lxml')
    names, images, links = zip(*[ ( item.text.strip(), base + item.select_one('img')['src'] , item['href'] if 'http' in item['href'] else base + item['href']) for item in soup.select('.flex-item > a') ])
    for link in links:
        r = s.get(link)
        soup = BeautifulSoup(r.content, 'lxml')
        try:
            info = soup.select_one('.trigger:nth-of-type(3) + div').text
        except:
            info = 'None'
            print(link)
        specimens.append(info)

df = pd.DataFrame([names, images, links, specimens])
df = df.transpose()
df.columns  = ['names', 'image_link', 'link', 'specimen']

Example of csv output:

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • getting error " File "C:\Users\acer\AppData\Local\Programs\Python\Python37\agri.py", line 16, in info = soup.select_one('.trigger:nth-of-type(3) + div').text AttributeError: 'NoneType' object has no attribute 'text' " –  Apr 20 '19 at 02:21
  • Did you run it exactly as is above? (Only changing the output filepath). I have run it 5 times in a row no problem. Added a second version using try except instead – QHarr Apr 20 '19 at 05:51
  • i run your above code and gives links in output screen and this error "df.to_csv(r"‪C:\Users\acer\Desktop\Data.csv", sep=',', encoding='utf-8-sig',index = False )" "OSError: [Errno 22] Invalid argument:'\u202aC:\\Users\\acer\\Desktop\\Data.csv'" –  Apr 20 '19 at 06:18
  • What OS are you on (that line is written for Windows paths)? Also, try simply print(df) and comment out csv write line at end. Does df print out as expected? – QHarr Apr 20 '19 at 06:28
  • i am using windows 10 , after print(df) getting output as all names in names columns , then second column contains 3 dots and 3 column contains none in specimen column –  Apr 20 '19 at 08:52
  • Not sure what else to suggest as work perfectly every time for me. – QHarr Apr 20 '19 at 09:02
  • https://stackoverflow.com/questions/25584124/oserror-errno-22-invalid-argument-when-use-open-in-python – QHarr Apr 22 '19 at 12:39