Loading more content in a webpage and issues writing to a file

Question

I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.

I am currently stuck with 2 issues.

Only the first few links are scraped. I'm unable to extract links from other pages(Website contains load more button). I don't know how to use the XHR object in the code.

The second half of the code reads only the last link(stored in the csv file), scrapes the respective information and stores it in a text file. It does not go through all the links from the beginning. I am unable to figure out where I have gone wrong in terms of file handling and f.seek(0).

from pprint import pprint
import requests
import lxml
import csv
import urllib2
from bs4 import BeautifulSoup

def get_url_for_search_key(search_key):
    base_url = 'http://www.marketing-interactive.com/'
    response = requests.get(base_url + '?s=' + search_key)
    soup = BeautifulSoup(response.content, "lxml")
    return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
    results = soup.findAll('a', {'rel': 'bookmark'})

for r in results:
    if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
        newlinks.append(r["href"])

pprint(get_url_for_search_key('digital advertising'))
with open('ctp_output.csv', 'w+') as f:
    f.write('\n'.join(get_url_for_search_key('digital advertising')))
    f.seek(0)

Reading CSV file, scraping respective content and storing in .txt file

with open('ctp_output.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
        url = line[0]       
        soup = BeautifulSoup(urllib2.urlopen(url))

        with open('ctp_output.txt', 'a+') as f2:
            for tag in soup.find_all('p'):
                f2.write(tag.text.encode('utf-8') + '\n')

Why use beautiful soup and not just lxml? https://stackoverflow.com/questions/5493514/webscraping-with-beautifulsoup-or-lxml-html --edit-- because it is been upgraded : https://stackoverflow.com/questions/4967103/beautifulsoup-and-lxml-html-what-to-prefer#answer-4967121 — RvdBerg, Jul 19 '17 at 09:19
during `get_url_for_search_key` you have a `return` statement in the middle, which means the rest of that function (under `return`) is always ignored... — Ofer Sadan, Jul 19 '17 at 09:40
@OferSadan I tried changing the placement of return statement. But i'm unable to append new links. Only the first ten links are scraped. — Rrj17, Jul 20 '17 at 06:12
@RvdBerg thanks for that. Got a fair idea of it but I don't know what kind of changes I've to make in my code. — Rrj17, Jul 20 '17 at 06:13

cs95 · Accepted Answer · 2017-07-21T19:49:42.373

Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.

with open('ctp_output.csv', 'rb') as f1:
    f1.seek(0)
    reader = csv.reader(f1)

    for line in reader:
        url = line[0]       
        soup = BeautifulSoup(urllib2.urlopen(url))

        with open('ctp_output.txt', 'a+') as f2:
            for tag in soup.find_all('p'):
                f2.write(tag.text.encode('utf-8') + '\n')

The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.

For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.

@Coldspeed : Changed the mode to a+. Still reads the contents of the last link in the csv file and prints it in 'ctp_output.txt'. The other links are not read. — Rrj17, Jul 21 '17 at 07:22

Tomasz Kluczkowski · Answer 2 · 2017-07-20T09:33:30.763

If there are more pages with results observe what changes in the URL when you manually click to go to the next page of results. I can guarantee 100% that a small piece of the URL will have eighter a subpage number or some other variable encoded in it that strictly relates to the subpage. Once you have figured out what is the combination you just fit that into a for loop where you put a .format() into the URL that you want to scrape and keep navigating this way through all the subpages of the results.

As to what is the last subpage number - you have to inspect the html code of the site you are scraping and find the variable responsible for it and extract its value. See if there is "class": "Page" or equivalent in their code - it may contain that number that you will need for your for loop.

Unfortunately there is no magic navigate through subresults option.... But this gets pretty close :).

Good luck.

This is the link I got under 'XHR and fetch' after pressing the load more button: http://www.marketing-interactive.com/wp-content/themes/MI/library/inc/loop_handler.php?pageNumber=2&postType=search&searchValue=digital+advertising Any idea as to how I can use this in my code? — Rrj17, Jul 21 '17 at 07:26

Loading more content in a webpage and issues writing to a file

Reading CSV file, scraping respective content and storing in .txt file

2 Answers2

Linked