Python string splitting with more than one separator and non roman characters

Question

I've been asking about it here lately, but I have one more example that I can not deal with.

import re

title = "Nad Ziemią / Above Ground – test - filmy i seriale"

if title.find('/') >= 0:
    original_title = (re.split('[-/()]', title)[1])

print(original_title)

The result of this will be:

Above Ground - test

And I need to split another dash to get only the movie title:

Above Ground

Is it possible to do all in one?

Regards.

Could not you just do: `title.split('/ ')[1].split(' -')[0]` ? — Austin, Oct 04 '18 at 17:00

score 2 · Answer 1 · answered Oct 04 '18 at 17:20

2

Investigating your question further it looks like that character isn't a normal hyphen (it's slightly higher), copy it into your regex expression and you'll see:

import re

title = "Nad Ziemią / Above Ground – test - filmy i seriale"

if title.find('/') >= 0:
    original_title = (re.split('[–\-/()]', title)[1])


print(original_title)

Bonus points if anyone can work out what the character is.

answered Oct 04 '18 at 17:20

Sven Harris

2,884
1
10
20

Wow, I need to buy new glasses, I am sorry, thanks. – serengeti Oct 04 '18 at 17:22

score 2 · Answer 2 · answered Oct 04 '18 at 17:21

With regex you can use the positive loosbehind assertion. Find the documentation here :)

import re

title = "Nad Ziemią / Above Ground – test - filmy i seriale"

if title.find('/') >= 0:
    original_title = re.search('(?<=[-/()])[ \w]+', title)

print(original_title.group(0))

Output:

Above Ground

benvc · Accepted Answer · 2018-10-04T20:39:59.207

1

IMPORTANT: The below works as written in Python 3, but for Python 2.7 (or older versions) you will need to deal with the differences in default encoding. See Unicode HOWTO: Unicode Literals in Python Source Code to determine what might be needed in your specific situation.

A little trickier than it appears at first because there are non roman characters in your string and the first and second dashes are not actually the same character (first one is an en dash). You can actually get the result you are looking for without regex if you first encode the string, then split on the en dash code, then split the first result on your forward slash, and then decode the result.

title = "Nad Ziemią / Above Ground – test - filmy i seriale"

title.encode().split(b'\xe2\x80\x93')[0].split(b'/')[1].decode()

# OUTPUT
# Above Ground

edited Oct 04 '18 at 20:39

answered Oct 04 '18 at 17:27

benvc

14,448
4
33
54

Well, now I see there are some problems with those chars with Python 2.1.7. Tried to use your solution but got: UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 2: ordinal not in range(128) – serengeti Oct 04 '18 at 18:13
1

Oh, had to add: import sys reload(sys) sys.setdefaultencoding('utf8'). Now it works fine. – serengeti Oct 04 '18 at 18:20
1

@serengeti - really important point. Glossed over the python-2.7 tag on your question, which is a real problem in this case. Python 2.7 default encoding is ASCII and before 2.4 default encoding was Latin-1. There are a number of ways to work around this issue and anyone using older versions of Python should read [Unicode HOWTO: Unicode Literals in Python Source Code](https://docs.python.org/2/howto/unicode.html#unicode-literals-in-python-source-code) to help determine what will work in their specific situation. – benvc Oct 04 '18 at 20:37

Python string splitting with more than one separator and non roman characters

3 Answers3