-1

So I have a bunch of strings pulled from an anki deck of mine. Strings that look like this:

enter image description here

I want to remove all of the substrings that are like "<font color>" etc basically. So take a sentence like this:

彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。

and turn it into:

彼女は看護婦です。

And I need to do this for a whole list of sentences. I tried using the following code:

import re

s = '彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。'
x = re.sub(r'\<.+\>','',s)
print(x)

and I get the following output:

彼女はです。

When it should be

彼女は看護婦です。

essentially its passing over the middle bit and not just taking out each instance. So essentially what I'm trying to do is analyse 5400 sentences and turn them into sentences without the other stuff in them.

To take a small subsection of the list it would be like turning this:

さあ、最上級の感謝を贈るぞ

その偉大な画家の<font color="#ff0000"><font color="#ff0000">傑作</font></font>が壁にさかさまにかかっているを見て、彼は驚いた。

彼はキリスト教に<font color="#ff0000"><font color="#ff0000">偏見</font></font>を抱いている

人種的偏見のない人はいないという事実は否定できない。

ボクは旅の途中で近くを通りかかったところをシド王子にここまで誘導されたゴロ

生まれたての稚魚みたいにフラフラと…<br>

滝壺まで泳いで行って一気に滝登りだ!

光っている印が神獣ヴァ・ルッタを制御する端末

<font color="#ff0000"><font color="#ff0000">芝生</font></font>が素敵にみえる。

and turning it into:

さあ、最上級の感謝を贈るぞ
    
その偉大な画家の傑作が壁にさかさまにかかっているを見て、彼は驚いた。
    
彼はキリスト教に偏見を抱いている
    
人種的偏見のない人はいないという事実は否定できない。
    
ボクは旅の途中で近くを通りかかったところをシド王子にここまで誘導されたゴロ
    
生まれたての稚魚みたいにフラフラと…
    
滝壺まで泳いで行って一気に滝登りだ!
    
光っている印が神獣ヴァ・ルッタを制御する端末
    
芝生が素敵にみえる。

Sorry I'm new to coding so this stuff is still a little difficult for me

  • 2
    Try `.+?` instead of `.+` – alani Aug 06 '20 at 18:08
  • Looks like a web scrapper. I'm currently writing my own website -> ebook scraper myself. You might want to look into [beautiful soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), it's an xml / html parser library for python and is designed to be able to handle badly written html as well (Edit: Herpa derp. Didn't read [Juan C.'s answer](https://stackoverflow.com/a/63289452/2716305)) – Lightfire228 Aug 06 '20 at 18:30

2 Answers2

4

Your misunderstanding lies in the pattern which you're using to match and substitute. r'\<.+\>' is greedy, meaning it will match as much as it possibly can. In this sample you've provided, your pattern is taking everything (.+) between the first < it finds and the last >. You can visualize that behavior in a tool like Regex101 to make it a bit easier to understand.

Instead, make your pattern "lazy" by adding the ? qualifier to your .+ pattern:

import re

s = '彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。'
x = re.sub(r'\<.+?\>','',s)
print(x) # 彼女は看護婦です。

Repl.it | Regex101

However, you really should be using a proper HTML parser for this type of activity. Regex is generally regarded as not being a good tool for working with HTML content. See Juan C's answer to this question for an example on how you might be able to accomplish that.

esqew
  • 42,425
  • 27
  • 92
  • 132
3

If you don't mind using another library, you can easily parse html code into string with BeautifulSoup:

from bs4 import BeautifulSoup

s = '彼女は<font color="#ff0000"><font color="#ff0000">看護婦</font></font>です。'

soup = BeautifulSoup(s, 'lxml')

print(soup.text)

Output:

Out[29]: '彼女は看護婦です。'
Juan C
  • 5,846
  • 2
  • 17
  • 51
  • 1
    I think this may very well be the "right" answer - [HTML should be parsed with a proper HTML parser, not Regex.](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – esqew Aug 06 '20 at 18:13