3

For most news articles, the first sentences always start with a location follows by hyphen or a comma, such as

KUALA LUMPUR: North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.

PORTLAND, Maine — FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.

I am trying to use re to separate out the later half which is the main sentence, such as

North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.

I use the following regrex to separate them:

sep = re.split('-|:|--', sent)

But this doesn't work for everything, the result of second sentence is:

['PORTLAND, Maine \xe2\x80\x94 FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.']

Is there anything to do with unicode? Or do I need to pass in different format of hyphen in the re code?

Is there a universal way to do this better?

Thanks.

Sean
  • 1,161
  • 1
  • 13
  • 24

2 Answers2

2

As you've guessed, the problem is the unicode characters present in the string, because there isn't an ASCII character with the same value as an em dash the separator in PORTLAND, Maine — FairPoint Communications isn't interpreted nicely and becomes \xe2\x80\x94 rather than \u2014.

There are a few options that will allow you to do what you want to:

  • define the source code encoding as unicode (set # -*- coding: utf-8 -*- as either of the first two lines) and add the extra character to your regex.
  • you can convert the string to ACSII using one of the available libraries (see convert a unicode string)
  • use a unicode compatible regex with re (sep = re.split(ur'-|:|--|\u2014', sent))
  • or as advised in the re documentation use the regex module.
Community
  • 1
  • 1
KMR
  • 792
  • 13
  • 21
0

Since, your second sentence contains UNICODE character, you need to define source code encoding before executing your code as python's default encoding is ASCII. Moreover, you're trying to spit the sentence using wrong character --. It needs to be (it's a UNICODE)

python ( demo )

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
sent = "PORTLAND, Maine — FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth."
sep = re.split('-|:|—', sent)
print sep
m87
  • 4,445
  • 3
  • 16
  • 31