0

Good day,

I have an html as string, where I need to find any class which has word 'content' there.

For example:

class='?content?'

Where ? - any number of symbols or characters.

I wanted to pass variable with the right string instead of 'entry-content'. However I can not input 'div[class*="content"] - it doesnt' work for me.

If there is a way to match all classes with 'content' without preprocessing of html, that would be perfect. Its just that preproccessing was my initial idea.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import sys
import urllib
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
import re


resultText = ''
url = 'http://kakzarabativat.ru/soveti/s-chego-nachat-biznes-ili-poshagovyj-plan-starta-biznesa/'
html = urllib.request.urlopen(url).read()

content = soup.find('div', {'class': 'entry-content'})
raw = content.find_all('p')
for item in raw:
    text = BeautifulSoup(str(item), 'html.parser').get_text()
    resultText += text + ' '
    resultText = resultText.replace("\n", "")
    resultText = resultText.replace("\xa0", "")
    resultText = resultText.replace("\n\n ", "")

Sorry If thats a stupid question, or I'm making it totally wrong.

  • So you're looking for elements that have a class that contains `content`? – ctwheels Apr 17 '18 at 15:39
  • 2
    [Why do you want to use regular expressions for this instead of an HTML parser?](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) When the answer is "I already know regex and don't want to learn something new", that's bad enough, but someone who doesn't know how to write the regex for "any number of symbols or characters" isn't even in that situation. – abarnert Apr 17 '18 at 15:42
  • @ctwheels Yeah, thats right. – Nikita Polovinkin Apr 17 '18 at 15:42
  • So what's the expected output? A list of class names? Or a list of all HTML elements with a matching class? If the latter - use a HTML parser. If the former - still consider using a HTML parser instead of regex. – Aran-Fey Apr 17 '18 at 15:43
  • @abarnert I'm lawer and beginner in python (2 months of learning). I use html parser and get a content of 1 web page, where i specify film_list = soup.find('div', {'class': 'entry-content'}) . But its not 'entry-content' everywhere - it can be anything, i want to put variable inside this 'soup.find' that mathes other types of content. – Nikita Polovinkin Apr 17 '18 at 15:44
  • @NikitaPolovinkin use `BeautifulSoup` – ctwheels Apr 17 '18 at 15:45
  • 2
    Possible duplicate of [Beautiful Soup if Class "Contains" or Regex?](https://stackoverflow.com/questions/34660417/beautiful-soup-if-class-contains-or-regex) – ctwheels Apr 17 '18 at 15:45
  • @Aran-Fey I have: `content = soup.find('div', {'class': 'entry-content'}) raw = content.find_all('p') for item in raw: text = BeautifulSoup(str(item), 'html.parser').get_text() resultText += text + ' ' resultText = resultText.replace("\n", "") resultText = resultText.replace("\xa0", "") resultText = resultText.replace("\n\n ", "")` Which matches only one page with specific class, I want to be able to match any case. – Nikita Polovinkin Apr 17 '18 at 15:47
  • 1
    @NikitaPolovinkin Put all of the information needed to answer your question _in the question_. Don't wait for people to drag it out of you, and then leave it only in comments that aren't visible unless someone clicks on your question and scrolls down. Edit your question to include a [mcve], and to include the fact that you're trying to do this with BeautifulSoup, not just parsing the raw HTML with regular expressions. A question like that would get an immediate answer or dup-close, instead of comments that don't help you, downvotes, and too-broad/unclear/etc. close votes. – abarnert Apr 17 '18 at 15:54
  • @abarnert Thanks, edited the post and I will take into account your comments for future. – Nikita Polovinkin Apr 17 '18 at 16:02
  • @ctwheels Thanks! Your link helped. I can not find how to choose your answer as best one – Nikita Polovinkin Apr 17 '18 at 16:20
  • @NikitaPolovinkin you can just mark your question as a duplicate of that one, it'll possibly help others find the correct answer in the future :) – ctwheels Apr 17 '18 at 16:35

0 Answers0