7

I want to remove part of a string (shown in bold) below, this is stored in the string oldString

[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY

im using the following regex within python

p=re.compile(ur"( [\W]+) (?=[A-Za-z ]+–)", re.UNICODE)
newString=p.sub("", oldString)

when i output the newString nothing has been removed

Paul Thomas
  • 93
  • 1
  • 4
  • `oldString` should also be converted to Unicode. Is it? How do you obtain it? Try `oldString = unicode(oldString, "utf-8")` before declaring `p`. – Wiktor Stribiżew Sep 30 '15 at 10:21
  • What's your expected output? – Mazdak Sep 30 '15 at 10:23
  • @stribizhev i specify `# -*- coding: utf-8 -*-` at the top of the file, from what i've been reading this should convert it to unicode, I obtain it from scraping a HTML page @Kasramvd expected output should be "[DMSM-8433] Kago Ai – 加護亜依 vs. FRIDAY" – Paul Thomas Sep 30 '15 at 10:50
  • Try this [snippet](https://ideone.com/fN74qX). – Wiktor Stribiżew Sep 30 '15 at 10:57
  • Related: http://stackoverflow.com/questions/15033196/using-javascript-to-check-whether-a-string-contains-japanese-characters-includi/15034560#15034560 – nhahtdh Sep 30 '15 at 10:58
  • @stribizhev that seems to work like a charm, thanks for that! – Paul Thomas Sep 30 '15 at 14:14

2 Answers2

6

You can use the following snippet to solve the issue:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY'
regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)'
p = re.compile(regex, re.U)
match = p.sub("", str)
print match.encode("UTF-8")

See IDEONE demo

Beside # -*- coding: utf-8 -*- declaration, I have added @nhahtdh's character class to detect Japanese symbols.

Note that the match needs to be encoded as UTF-8 string "manually" since Python 2 needs to be "reminded" we are working with Unicode all the time.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

I think you should use a regular expression like this one:

([\p{Hiragana}\p{Katakana}\p{Han}]+)

please refer also to this documentation.

EDIT: I also tested it here.

teoreda
  • 2,392
  • 1
  • 21
  • 28
  • 1
    Python re doesn't support Unicode properties. Of course, there is [regex](https://pypi.python.org/pypi/regex) package, but you need to mention it in the answer. (Also, I'm not quite sure whether the syntax above would be accepted in regex package) – nhahtdh Sep 30 '15 at 10:39
  • that seems to work with PHP but not with Python, when running through Python is strips "Kag" and "i" from "Kago Ai" – Paul Thomas Sep 30 '15 at 10:44
  • @nhahtdh using re package at the moment, didnt realise there was another, I'll have a read through the link – Paul Thomas Sep 30 '15 at 10:52
  • [Test it with proper settings](https://regex101.com/r/oE0oL5/3). Results are very different. – Wiktor Stribiżew Sep 30 '15 at 10:52
  • I agree with you @stribizhev – teoreda Sep 30 '15 at 10:54