How do I parse only foreign characters from the text in an HTML file with regular expressions

Question

I'm trying to parse HTML and automatically change the font of any foreign characters, and I'm having some issues. There are a few different hackish ways I'm trying to accomplish this, but none work really well, and I'm wondering if anyone has any ideas. Is there any easy way with python to match all the foreign characters (specifically, Japanese Kanji/Hirigana/Katakana) with regular expressions? What I've been using is the complement of a set of non-foreign characters ([^A-Za-z0-9 <>'"=]), but this isn't working well, and I'm worried it will match things enclosed in <...>, which I don't want to do.

score 2 · Answer 1 · edited May 23 '17 at 10:33

2

I wouldn't use just regular expressions for this. Down that path lies an angry Tony the Pony.

I'd use an HTML parser in conjuction with regular expressions, though. That way you can distinguish the markup from the non-markup.

edited May 23 '17 at 10:33

Community

1
1

answered Aug 18 '10 at 16:46

John

15,990
10
70
110

You linked to the question. The answer is [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Kevin Vermeer Aug 18 '10 at 16:54
But that takes out some of the fun of finding Tony! ;) – John Aug 18 '10 at 17:45

Chinmay Kanchi · Answer 2 · 2010-08-18T17:30:58.113

Use BeautifulSoup to get the content that you need, then use a variation on this code to match your characters.

import re

kataLetters = range(0x30A0, 0x30FF)
hiraLetters = range(0x3040, 0x309F)
kataPunctuation = range(0x31F0,0x31FF)

myLetters = kataLetters+kataPunctuation+hiraLetters

myLetters = u''.join([unichr(aLetter) for aLetter in myLetters])


myRe = re.compile('['+myLetters+']+', re.UNICODE)

Use the code charts here to get the ranges for your characters.

How do I parse only foreign characters from the text in an HTML file with regular expressions

2 Answers2