0

I'm doing scraping Indonesian news website from here. When I'm scraped the news articles from each news links, there is some HTML element on it. The output like this:

enter image description here

I want to remove the elements so the output is just the article. I already use .strip() but still doesn't affect the output. This is my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

detik = requests.get('https://www.detik.com/terpopuler')
beautify = BeautifulSoup(detik.content, 'html5lib')

news = beautify.find_all('article', {'class','list-content__item'})
arti = []
for each in news:
  try:
    title = each.find('h3', {'class','media__title'}).text
    lnk = each.a.get('href')
    r = requests.get(lnk)
    soup = BeautifulSoup(r.text, 'html5lib')
    content = soup.find('div', {'class', 'detail__body-text itp_bodycontent'}).text.strip()
    
    print(title)
    print(lnk)

    arti.append({
      'Headline': title,
      'Content':content,
      'Link': lnk
    })
  except:
    continue
df = pd.DataFrame(arti)
df.to_csv('detik.csv', index=False)

Any help would be appreciated

Yoel Regen
  • 81
  • 1
  • 11

1 Answers1

0

You might be dealing with invalid tags. This thread might be useful: https://stackoverflow.com/a/8439761/6100602