Remove html element from scraping result python

Question

I'm doing scraping Indonesian news website from here. When I'm scraped the news articles from each news links, there is some HTML element on it. The output like this:

I want to remove the elements so the output is just the article. I already use .strip() but still doesn't affect the output. This is my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

detik = requests.get('https://www.detik.com/terpopuler')
beautify = BeautifulSoup(detik.content, 'html5lib')

news = beautify.find_all('article', {'class','list-content__item'})
arti = []
for each in news:
  try:
    title = each.find('h3', {'class','media__title'}).text
    lnk = each.a.get('href')
    r = requests.get(lnk)
    soup = BeautifulSoup(r.text, 'html5lib')
    content = soup.find('div', {'class', 'detail__body-text itp_bodycontent'}).text.strip()
    
    print(title)
    print(lnk)

    arti.append({
      'Headline': title,
      'Content':content,
      'Link': lnk
    })
  except:
    continue
df = pd.DataFrame(arti)
df.to_csv('detik.csv', index=False)

Any help would be appreciated

score 0 · Answer 1 · answered Nov 09 '20 at 16:45

0

You might be dealing with invalid tags. This thread might be useful: https://stackoverflow.com/a/8439761/6100602

answered Nov 09 '20 at 16:45

Vincent Pellerito

1
2

Remove html element from scraping result python

1 Answers1