For most news articles, the first sentences always start with a location follows by hyphen or a comma, such as
KUALA LUMPUR: North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.
PORTLAND, Maine — FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.
I am trying to use re to separate out the later half which is the main sentence, such as
North Korea and Malaysia on Monday locked horns over the investigation into the killing of leader Kim Jong-Un’s brother, as footage emerged of the moment he was fatally attacked in a Kuala Lumpur airport.
I use the following regrex to separate them:
sep = re.split('-|:|--', sent)
But this doesn't work for everything, the result of second sentence is:
['PORTLAND, Maine \xe2\x80\x94 FairPoint Communications has asked regulators for permission to stop signing up new customers for regulated landline service in Scarborough, Gorham, Waterville, Kennebunk and Cape Elizabeth.']
Is there anything to do with unicode? Or do I need to pass in different format of hyphen in the re code?
Is there a universal way to do this better?
Thanks.