Splitting text file delimited by special character

Question

I have a text file, test.txt which has the following data:

content content
more content
content conclusion
==========
content again
more of it
content conclusion
==========
content
content
contend done
==========

I would like to get a list of chunks delimited by ==========.

For the above example, I expect something like this:

foo = ["content content\more content\content conclusion",
       "content again\more of it\content conclusion",
       "content\content\contend done"]

Also, I would appreciate if someone can share a general process for performing this operation (if any).

Inspired by : Splitting large text file on every blank line

`open(...).read().split('==========')` – furas Jan 04 '17 at 09:50 — furas, Jan 04 '17 at 09:50
Try removing all `[\r]\n`s and split on your delimiter. – Nander Speerstra Jan 04 '17 at 09:50 — Nander Speerstra, Jan 04 '17 at 09:50

score 1 · Answer 1 · answered Jan 04 '17 at 09:55

1

y="""content content
more content
content conclusion
==========
content again
more of it
content conclusion
==========
content
content
contend done
=========="""
x=re.compile(r"(?:^|(?<=={10}))\n*([\s\S]+?)\n*(?=={10}|$)")
print re.findall(x, y)

Output:

['content content\nmore content\ncontent conclusion', 'content again\nmore of it\ncontent conclusion', 'content\ncontent\ncontend done']

answered Jan 04 '17 at 09:55

vks

67,027
10
91
124

This works! Thank you for your time and effort. – Kshitij Saraogi Jan 04 '17 at 10:04
Why downvoted ? – vks Jan 04 '17 at 11:33

Mazdak · Answer 2 · 2017-01-04T10:18:34.603

0

You can use regular expression to split your file based on 3 or more = character. Then replace the new lines with backslash:

import re

with open(file_name) as f:
    my_list = [chunk.strip().replace('\n', '\\') for chunk in re.split(r'={3,}', f.read())]

If you know the exact length of equal signs you can just use string split method:

N = 5 # this is an example
with open(file_name) as f:
    my_list = [chunk.strip().replace('\n', '\\') for chunk in f.read().split('=' * N)]

Also note that backslashes are used for escaping characters, and if you use them in your string it will escape the next character which means if you especial character wont be interpret as their original meaning.

Thus it's better to separate the lines with another delimiter:

N = 5 # this is an example
with open(file_name) as f:
    my_list = [chunk.strip().strip().replace('\n', '/') for chunk in f.read().split('=' * N)]

edited Jan 04 '17 at 10:18

answered Jan 04 '17 at 09:50

Mazdak

105,000
18
159
188

I get a different output though : `my_list = ['content content\\more content\\content conclusion\\', '', '\\content again\\more of it\\content conclusion\\', '', '\\content\\content\\contend done\\', '', '\\']` – Kshitij Saraogi Jan 04 '17 at 10:08
@KshitijSaraogi Checkout the update. – Mazdak Jan 04 '17 at 10:19
I will still need to refine the output. `my_list=['content content/more content/content conclusion', '', 'asdasd #92012 blaablaa 30 70/content again/more of it/content conclusion', '', 'asdasd #299 yadayada 60 40/content/content/contend done', '', '']` – Kshitij Saraogi Jan 04 '17 at 10:32

Splitting text file delimited by special character

2 Answers2