-4

I have a text file, test.txt which has the following data:

content content
more content
content conclusion
==========
content again
more of it
content conclusion
==========
content
content
contend done
==========

I would like to get a list of chunks delimited by ==========.

For the above example, I expect something like this:

foo = ["content content\more content\content conclusion",
       "content again\more of it\content conclusion",
       "content\content\contend done"]

Also, I would appreciate if someone can share a general process for performing this operation (if any).

Inspired by : Splitting large text file on every blank line

Community
  • 1
  • 1
Kshitij Saraogi
  • 6,821
  • 8
  • 41
  • 71

2 Answers2

1
y="""content content
more content
content conclusion
==========
content again
more of it
content conclusion
==========
content
content
contend done
=========="""
x=re.compile(r"(?:^|(?<=={10}))\n*([\s\S]+?)\n*(?=={10}|$)")
print re.findall(x, y)

Output:

['content content\nmore content\ncontent conclusion', 'content again\nmore of it\ncontent conclusion', 'content\ncontent\ncontend done']

vks
  • 67,027
  • 10
  • 91
  • 124
0

You can use regular expression to split your file based on 3 or more = character. Then replace the new lines with backslash:

import re

with open(file_name) as f:
    my_list = [chunk.strip().replace('\n', '\\') for chunk in re.split(r'={3,}', f.read())]

If you know the exact length of equal signs you can just use string split method:

N = 5 # this is an example
with open(file_name) as f:
    my_list = [chunk.strip().replace('\n', '\\') for chunk in f.read().split('=' * N)]

Also note that backslashes are used for escaping characters, and if you use them in your string it will escape the next character which means if you especial character wont be interpret as their original meaning.

Thus it's better to separate the lines with another delimiter:

N = 5 # this is an example
with open(file_name) as f:
    my_list = [chunk.strip().strip().replace('\n', '/') for chunk in f.read().split('=' * N)]
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • I get a different output though : `my_list = ['content content\\more content\\content conclusion\\', '', '\\content again\\more of it\\content conclusion\\', '', '\\content\\content\\contend done\\', '', '\\']` – Kshitij Saraogi Jan 04 '17 at 10:08
  • @KshitijSaraogi Checkout the update. – Mazdak Jan 04 '17 at 10:19
  • I will still need to refine the output. `my_list=['content content/more content/content conclusion', '', 'asdasd #92012 blaablaa 30 70/content again/more of it/content conclusion', '', 'asdasd #299 yadayada 60 40/content/content/contend done', '', '']` – Kshitij Saraogi Jan 04 '17 at 10:32