1

it is easy to split text using regex at non-alpha characters:

tokens=re.split(r'(?u)\W+',text) #to split at any non-alpha unicode character

and This answer provides a way to split at certain characters. However, what I need is:

  1. splitting at any unicode non-alpha
  2. give regex the following exceptions:

    • underscores "_"
    • this slash"/"
    • ampersand "&" and at sign "@"
    • fullstops surrounded by digits \d+
    • fullstops preceded by certain arbitrary strings "Mr.", "Dr."...etc

I can easily detect any of these using regex, but the question is how to tell regex to have them as exceptions to the splitting at non-alpha.


EDIT: Here is an example text I am trying to match:

text="Mr. Jones email jones@gmail.com 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. وبعدها رجعنا إلى المنزل، وقابلنا أصدقاءنا؛ وشربنا الشاي."

and here is its version in unicode (notice the non-alpha characters in Arabic u'\u060c', u'\u061b')

unicode_text=u'Mr. Jones email jones@gmail.com 12.455 12,254.25 says This is@a&test example_cool man+right more/fun 43.35. And so we stopped. And then we started again. \u0648\u0628\u0639\u062f\u0647\u0627 \u0631\u062c\u0639\u0646\u0627 \u0625\u0644\u0649 \u0627\u0644\u0645\u0646\u0632\u0644\u060c \u0648\u0642\u0627\u0628\u0644\u0646\u0627 \u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627\u061b \u0648\u0634\u0631\u0628\u0646\u0627 \u0627\u0644\u0634\u0627\u064a.'

Here is the result of the regex in the answer provided:

re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+',unicode_text)

[u'Mr.', u'Jones', u'email', u'jones@gmail.com', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man+right', u'more/fun', u'43.35.', u'And', u'so', u'we', u'stopped.', u'And', u'then', u'we', u'started', u'again.', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a.']

Notice that the regex did not split around fullstops at the end of words. So it would be nice to have something to deal with this

Community
  • 1
  • 1
hmghaly
  • 1,411
  • 3
  • 29
  • 47
  • yes, this is what I want – hmghaly Oct 18 '13 at 21:21
  • 1
    So what have you tried ? This is quite simple except for the last parts. Note that `\w` matches alphanumeric characters and an underscore `_` ! So `\W` is exactly the reverse of it. – HamZa Oct 18 '13 at 21:24
  • I tried this: tokens=re.split('(?u)[^\w_@/]|(?<!\d)[,.](?!\d)',string) but didn't work... – hmghaly Oct 18 '13 at 21:37
  • I'm not sure what you mean by "comparing"... I want the regex to split around any non-alpha character unless this character is [.,] and it is surrounded by things – hmghaly Oct 18 '13 at 21:53
  • When you say "it didn't work" please be specific. What did it match? Anything? Did the script fail with an error? – SethMMorton Oct 19 '13 at 01:17
  • Hi @SethMMorton I made an edit with examples. – hmghaly Oct 19 '13 at 12:17

2 Answers2

0

The key is to use a negative lookahead. I think this covers all the examples on your list, but let me know if there's something I missed.

In [549]: re.split(r'(?u)(?![\+&\/@\d+\.\d+Mr\.])\W+', "Mr.Jones says This is@a&test example_cool man+right more/fun 43.35")
Out[549]: ['Mr.Jones', 'says', 'This', 'is@a&test', 'example_cool', 'man+right', 'more/fun', '43.35']

Anything inside the group in the (?!) will not be matched. Let me know if I understood the question correctly.

Kyle Hannon
  • 2,139
  • 1
  • 14
  • 13
  • Thank you, but it didn't work as desired, please see my edit above. – hmghaly Oct 19 '13 at 12:18
  • What I'm getting from you is that the answer worked for the problem you provided, but now you want it to match Arabic? Non alpha characters in foreign language should be handled by the re library. If the standard definition of non-alpha doesn't match yours, simply extend the methodology I explained. – Kyle Hannon Oct 21 '13 at 02:03
0

I don't think you want to split e-mail addresses like jones@gmail.com in jones@gmail and com, hence I changed your exception requirement fullstops surrounded by digits to full stops followed by an alphanumeric character.

re.split(r'(?u)(?![_/&@.])\W+|(?<!Mr|Dr)\.(?!\w)\W*', unicode_text)

[u'Mr.', u'Jones', u'email', u'jones@gmail.com', u'12.455', u'12', u'254.25', u'says', u'This', u'is@a&test', u'example_cool', u'man', u'right', u'more/fun', u'43.35', u'And', u'so', u'we', u'stopped', u'And', u'then', u'we', u'started', u'again', u'\u0648\u0628\u0639\u062f\u0647\u0627', u'\u0631\u062c\u0639\u0646\u0627', u'\u0625\u0644\u0649', u'\u0627\u0644\u0645\u0646\u0632\u0644', u'\u0648\u0642\u0627\u0628\u0644\u0646\u0627', u'\u0623\u0635\u062f\u0642\u0627\u0621\u0646\u0627', u'\u0648\u0634\u0631\u0628\u0646\u0627', u'\u0627\u0644\u0634\u0627\u064a', u'']

Armali
  • 18,255
  • 14
  • 57
  • 171