I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.
I am currently stuck with 2 issues.
- Only the first few links are scraped. I'm unable to extract links from other pages(Website contains load more button). I don't know how to use the XHR object in the code.
The second half of the code reads only the last link(stored in the csv file), scrapes the respective information and stores it in a text file. It does not go through all the links from the beginning. I am unable to figure out where I have gone wrong in terms of file handling and f.seek(0).
from pprint import pprint import requests import lxml import csv import urllib2 from bs4 import BeautifulSoup def get_url_for_search_key(search_key): base_url = 'http://www.marketing-interactive.com/' response = requests.get(base_url + '?s=' + search_key) soup = BeautifulSoup(response.content, "lxml") return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})] results = soup.findAll('a', {'rel': 'bookmark'}) for r in results: if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark': newlinks.append(r["href"]) pprint(get_url_for_search_key('digital advertising')) with open('ctp_output.csv', 'w+') as f: f.write('\n'.join(get_url_for_search_key('digital advertising'))) f.seek(0)
Reading CSV file, scraping respective content and storing in .txt file
with open('ctp_output.csv', 'rb') as f1: f1.seek(0) reader = csv.reader(f1) for line in reader: url = line[0] soup = BeautifulSoup(urllib2.urlopen(url)) with open('ctp_output.txt', 'a+') as f2: for tag in soup.find_all('p'): f2.write(tag.text.encode('utf-8') + '\n')