1

I have sucessfully crawled the Headline and the Links.

enter image description here I would like to replace the Summary tab with The Main Article from the link (Since the Title and Summary are same anyways. )

link = "https://www.vanglaini.org" + article.a['href']

(eg. https://www.vanglaini.org/tualchhung/103834)

Please help me modify my code.

Below is my code.

import pandas as pd
import requests
from bs4 import BeautifulSoup

source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')

list_with_headlines = []
list_with_summaries = []
list_with_links = []

for article in soup.find_all('article'):
    if article.a is None:
        continue
    headline = article.a.text.strip()
    summary = article.p.text.strip()
    link = "https://www.vanglaini.org" + article.a['href']
    list_with_headlines.append(headline)
    list_with_summaries.append(summary)
    list_with_links.append(link)

news_csv = pd.DataFrame({
    'Headline': list_with_headlines,
    'Summary': list_with_summaries,
    'Link' : list_with_links,
})

print(news_csv)
news_csv.to_csv('test.csv')
lczapski
  • 4,026
  • 3
  • 16
  • 32

1 Answers1

0

Just do request again inside for loop and get the tag text.

import pandas as pd
import requests
from bs4 import BeautifulSoup

source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')

list_with_headlines = []
list_with_summaries = []
list_with_links = []

for article in soup.find_all('article'):
    if article.a is None:
        continue
    headline = article.a.text.strip()
    link = "https://www.vanglaini.org" + article.a['href']
    list_with_headlines.append(headline)
    list_with_links.append(link)
    soup = BeautifulSoup(requests.get(link).text, 'lxml')
    list_with_summaries.append(soup.select_one(".pagesContent").text)

news_csv = pd.DataFrame({
    'Headline': list_with_headlines,
    'Summary': list_with_summaries,
    'Link' : list_with_links,
})

print(news_csv)
news_csv.to_csv('test.csv')

Csv will be like this.

enter image description here

KunduK
  • 32,888
  • 5
  • 17
  • 41
  • 1
    @anddrewww This is what you are after? – KunduK Nov 19 '19 at 16:06
  • Yes thanks a lot. I will ask if I have any more questions and I hope you will answer. (I am new to python) –  Nov 20 '19 at 02:22
  • How to add new news to the csv in one single file? I mean continue crawling everyday and add new info in the same csv files –  Nov 25 '19 at 15:14
  • I have thiss error https://stackoverflow.com/questions/59279064/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-500 –  Dec 11 '19 at 05:01