1

updates so far:

beautifulsoup works partly. How to remove whatever text between <style> and <\style>?


I am trying to write a function so that from such a text

<style>.card {
 font-family: arial;
 font-size: 20px;
 text-align: center;
 color: black;
 background-color: white;
}
</style>qüestion

<hr id=answer>

änswer

to get only these out

word[0] = qüestion
word[1] = änswer

the words could contain umlauts.

I thought re or regexcould probably do the job, but I couldn't succeed! Thanks for any help :)

Amin
  • 437
  • 2
  • 4
  • 17
  • Possible duplicate of [matching unicode characters in python regular expressions](http://stackoverflow.com/questions/5028717/matching-unicode-characters-in-python-regular-expressions) – Marcy May 10 '17 at 19:15
  • 1
    regex is generally regarded as not the ways to parse HTML, check beautifulsoup or lxml if you can. – Josep Valls May 10 '17 at 19:19
  • I have checked the link given as the possible duplicate. It is still unclear nad I would appreciate some hint! – Amin May 10 '17 at 19:27
  • @JosepValls Thanks, beautifulsoup works parly and removes `
    `, `
    – Amin May 10 '17 at 19:35
  • 1
    @Amin: I believe you can do that with `soup.find("style").clear()` – zondo May 10 '17 at 21:05

1 Answers1

1

How to remove whatever text between <style> and </style>?

You need to extract() the style tags or clear() them:

>>> from bs4 import BeautifulSoup
>>> s = '''<style>.card {
 font-family: arial;
 font-size: 20px;
 text-align: center;
 color: black;
 background-color: white;
}
</style>question

<hr id=answer>

answer'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> styles = [style.extract() for style in soup('style')] # Or, you may use...
>>> # soup.find("style").clear()
>>> results = soup.text.strip().split()
>>> print(results)
[u'question', u'answer']

With [style.extract() for style in soup('style')], you get all the style tags with their inner HTML and remove them from soup. Then, its text property only contains question and answer separated with some whitespace, so all you need to do is split the string.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563