Good day,
I have an html as string, where I need to find any class which has word 'content' there.
For example:
class='?content?'
Where ?
- any number of symbols or characters.
I wanted to pass variable with the right string instead of 'entry-content'. However I can not input 'div[class*="content"] - it doesnt' work for me.
If there is a way to match all classes with 'content' without preprocessing of html, that would be perfect. Its just that preproccessing was my initial idea.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import sys
import urllib
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
import re
resultText = ''
url = 'http://kakzarabativat.ru/soveti/s-chego-nachat-biznes-ili-poshagovyj-plan-starta-biznesa/'
html = urllib.request.urlopen(url).read()
content = soup.find('div', {'class': 'entry-content'})
raw = content.find_all('p')
for item in raw:
text = BeautifulSoup(str(item), 'html.parser').get_text()
resultText += text + ' '
resultText = resultText.replace("\n", "")
resultText = resultText.replace("\xa0", "")
resultText = resultText.replace("\n\n ", "")
Sorry If thats a stupid question, or I'm making it totally wrong.