0

Let's suppose that I have a document like that:

document = ["This is a document\nwhich has to be splitted\nOK/Right?"]

and I would like to split this document (for start) wherever I encounter '\n' or '/'.

So the document above should be transformed to the following one:

document = ["This is a document", "which has to be splitted", "OK", "Right?"]

How can I do this?

Keep in mind that there may be other special characters etc in the text and I do not want to remove them for now.

Outcast
  • 4,967
  • 5
  • 44
  • 99

4 Answers4

2

Use re to split a string of text based on multiple characters or combinations of characters:

document = ["This is a document\nwhich has to be splitted\nOK/Right?"]
re.split("[\n/]",document[0])

which produces the requested strings:

['This is a document', 'which has to be splitted', 'OK', 'Right?']

Outcast
  • 4,967
  • 5
  • 44
  • 99
Comos
  • 82
  • 10
0

This is a unique case where Regular Expressions shine! Use Python's re module:

>>> import re
>>> document = ["This is a document\nwhich has to be splitted\nOK/Right?"]
>>> re.split(r"[\n/]", document[0])
['This is a document', 'which has to be splitted', 'OK', 'Right?']

This SO post has the most discussion on this topic

Outcast
  • 4,967
  • 5
  • 44
  • 99
0

You can use re.split():

import re
def split_document(document):
    if document == []:
        return []
    tmp_str = document[0]
    tmp_list = re.split("\n|/",tmp_str)
    return tmp_list+split_document(document[1:])
Tianbo Ji
  • 125
  • 6
0

Using re.split() is probably the best solution.

An alternative solution without regular expressions:

document = ["This is a document\nwhich has to be splitted\nOK/Right?"]
document[0] = document[0].replace('/', '\n')
document[0].splitlines()
Elias Strehle
  • 1,722
  • 1
  • 21
  • 34