0

I'm using Python 3 to scrape a Pandas data frame I've created from a csv file that contains the source URLs of 63,067 webpages. The for-loop is supposed to scrape news articles from for a project to place into giant text files for cleaning later on.

I'm a bit rusty with Python and this project is the reason I've started programming in it again. I haven't used BeautifulSoup before, so I'm having some difficulty and just got the for-loop to work on the Pandas data frame with BeautifulSoup.

This is for one of the three data sets I'm using (the other two are programmed into the code below to repeat the same process for different data sets, which is why I'm mentioning this).

from bs4 import BeautifulSoup as BS
import requests, csv
import pandas as pd

negativedata = pd.read_csv('negativedata.csv')
positivedata = pd.read_csv('positivedata.csv')
neutraldata = pd.read_csv('neutraldata.csv')

negativedf = pd.DataFrame(negativedata)
positivedf = pd.DataFrame(positivedata)
neutraldf = pd.DataFrame(neutraldata)


negativeURLS = negativedf[['sourceURL']]

for link in negativeURLS.iterrows():
    url = link[1]['sourceURL']
    negative = requests.get(url)
    negative_content = negative.text

    negativesoup = BS(negative_content, "lxml")
    for text in negativesoup.find_all('a', href = True):
        text.append((text.get('href')))

I think finally got my for-loop to work for the code to run through all of the source URLs. However, I then get the error:

Traceback (most recent call last):
  File "./datacollection.py", line 18, in <module>
    negative = requests.get(url)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 140, in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

I know that the issue is when I'm requesting the URLs, but I'm not sure what–or if a–URL is the problem due to the amount of webpages that are in the data frame being iterated through. Is the problem a URL or that I have too many and should use a different package like scrapy?

ssmost
  • 41
  • 8

1 Answers1

0

I would suggest using modules like mechanize for scraping. Mechanize has a way of handling robots.txt and is much better if your application is scraping data from urls of different websites. But in your case, the redirect is probably because of not having user-agent in headers as mentioned here (https://github.com/requests/requests/issues/3596). And here's how you set headers with requests (Sending "User-agent" using Requests library in Python).

P.S: mechanize is only available for python2.x. If you wish to use python3.x, there are other options (Installing mechanize for python 3.4).

manji369
  • 186
  • 4
  • 16