Beautiful soup get text for multiple pages

Question

I try to scrape news pages of a German party and store all information in a dataframe ("python beginner"). There exist only a small problem, when I want to store the whole text or even the date into the dataframe. It seems like that only the last element of the text (p ... /p) will be stored in the row. I think the problem occurs because the iteration over the loop is misleading.

import pandas as pd
import requests 
from time import sleep
from random import randint
from time import time
import numpy as np
from urllib.request import urlopen

data = pd.DataFrame()
teaser = ()
title = []
content = ()
childrenUrls = []
mainPage = "https://www.fdp.de"
start_time = time()
counter = 0

#for i in list(map(lambda x: x+1, range(3))):
for i in range(3):

    counter = counter + 1
    sleep(randint(1,3))
    elapsed_time = time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(counter, counter/elapsed_time))
    url = "https://www.fdp.de/seite/aktuelles?page="+str(i)
    #print(url)
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')

    uls = soup.find_all('div', {'class': 'field-title'})

    for ul in uls:
        for li in ul.find_all('h2'):
            for link in li.find_all('a'):
                url = link.get('href')
                contents = link.text
                print(contents)
                childrenUrls = mainPage+url
                print(childrenUrls)

                childrenPages = urllib2.urlopen(childrenUrls)
                soupCP = BeautifulSoup(childrenPages, 'html.parser')

                #content1 = soupCP.findAll('p').get_text()
                #print(content1)

                for content in soupCP.findAll('p'):
                    #for message in content.get('p'):
                    content = content.text.strip()
                    print(content)

                for teaser in soupCP.find_all('div', class_ = 'field-teaser'):
                    teaser = teaser.text.strip()
                    print(date)

                for title in soupCP.find_all('title'):
                    title = title.text.strip()
                    print(ttt)

                df = pd.DataFrame(
                    {'teaser': teaser,
                     'title' : title,
                    'content' : content}, index=[counter])

                data = pd.concat([data, df])
    #join(str(v) for v in value_list)

score 2 · Answer 1 · answered Mar 26 '18 at 11:56

2

You have to save the text from each loop in a list, and not in a simple string variable. On each iteration, your code redefines the values on the variables; which leads to losing the previous data.

A good approach, is to use list comprehension here. Replace the last 3 for loops of your code with this:

content = [x.text.strip() for x in soupCP.find_all('p')]
teaser = [x.text.strip() for x in soupCP.find_all('div', class_='field-teaser')]
title = [x.text.strip() for x in soupCP.find_all('title')]

df = pd.DataFrame(
    {'teaser': teaser,
     'title': title,
     'content': content}, index=[counter])

data = pd.concat([data, df])

A simple explanation of list comprehension:

The line content = [x.text.strip() for x in soupCP.find_all('p')] is equivalent to:

content = []
for x in soupCP.find_all('p'):
    content.append(x.text.strip())

answered Mar 26 '18 at 11:56

Keyur Potdar

7,158
6
25
40

Okay, thanks your very much for your help! I replaced my code with your suggestions, but another problem occur, which is maybe related to the index or array of the dataframe. This is the displayed error code: `ValueError: Shape of passed values is (3, 5), indices imply (3, 1)` – Daniel Mar 27 '18 at 15:54
Have a look at this [question](https://stackoverflow.com/questions/27719407/pandas-concat-valueerror-shape-of-passed-values-is-blah-indices-imply-blah2). I think you've to replace `.concat` with `.join` – Keyur Potdar Mar 27 '18 at 16:00
1

If none of the solutions on SO work, I think you should ask a new question. This question was about the problem with `BeautifulSoup` (which is solved), and the new one is about `pandas`. If you want, you can link this question to the other. I don't have much experience with `pandas`, and if you ask a new question specific to the issue, you'll get better answers from others. – Keyur Potdar Mar 27 '18 at 16:57

Beautiful soup get text for multiple pages

1 Answers1