Split text based on multiple separators ('\n', '/')

Question

Let's suppose that I have a document like that:

document = ["This is a document\nwhich has to be splitted\nOK/Right?"]

and I would like to split this document (for start) wherever I encounter '\n' or '/'.

So the document above should be transformed to the following one:

document = ["This is a document", "which has to be splitted", "OK", "Right?"]

How can I do this?

Keep in mind that there may be other special characters etc in the text and I do not want to remove them for now.

score 2 · Answer 1 · edited May 23 '19 at 15:46

2

Use re to split a string of text based on multiple characters or combinations of characters:

document = ["This is a document\nwhich has to be splitted\nOK/Right?"]
re.split("[\n/]",document[0])

which produces the requested strings:

['This is a document', 'which has to be splitted', 'OK', 'Right?']

edited May 23 '19 at 15:46

Outcast

4,967
5
44
99

answered May 23 '19 at 15:37

Comos

82
10

score 0 · Answer 2 · edited May 23 '19 at 19:15

0

This is a unique case where Regular Expressions shine! Use Python's re module:

>>> import re
>>> document = ["This is a document\nwhich has to be splitted\nOK/Right?"]
>>> re.split(r"[\n/]", document[0])
['This is a document', 'which has to be splitted', 'OK', 'Right?']

This SO post has the most discussion on this topic

edited May 23 '19 at 19:15

Outcast

4,967
5
44
99

answered May 23 '19 at 15:44

Steven Kneiser

47
5

score 0 · Answer 3 · answered May 23 '19 at 15:46

0

You can use re.split():

import re
def split_document(document):
    if document == []:
        return []
    tmp_str = document[0]
    tmp_list = re.split("\n|/",tmp_str)
    return tmp_list+split_document(document[1:])

answered May 23 '19 at 15:46

Tianbo Ji

125
6

score 0 · Answer 4 · answered May 23 '19 at 16:19

0

Using re.split() is probably the best solution.

An alternative solution without regular expressions:

document = ["This is a document\nwhich has to be splitted\nOK/Right?"]
document[0] = document[0].replace('/', '\n')
document[0].splitlines()

answered May 23 '19 at 16:19

Elias Strehle

1,722
1
21
34

Split text based on multiple separators ('\n', '/')

4 Answers4