Using python to separate a long text file into multiple files based on hyphen line separators?

Question

Working to separate a single long text file into multiple files. Each section that needs to be placed into its own file, is separated by hyphen lines that look something like:

     This is section of some sample text
        that says something.
        
        2---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        
        This says something else
        
        3---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    Maybe this says something eles
    
    4---------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------

I have started the attempt in python without much success. I considered using the split fnx but I'm finding most examples provided for the split fnx revolve around len rather than regex type characters. This only generates one large file.

with open ('someName.txt','r') as fo:

    start=1
    cntr=0
    for x in fo.read().split("\n"):
        if x=='---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------':
            start = 1
            cntr += 1
            continue
        with open (str(cntr)+'.txt','a+') as opf:
            if not start:
                x = '\n'+x
            opf.write(x)
            start = 0

Most of your `---` lines have leading whitespace, so `x == '---...-'` won't be true for them. (I'm assuming you added the numbers for the question, but if not, you need to match against them as well.) — chepner, Mar 15 '22 at 00:45

Alexander · Accepted Answer · 2022-03-17T05:26:01.317

1

You might get better results from switching the conditional from == to in. That way if the line you are testing has any leading characters it will still pass the condition. For example below I changed the x=='-----...' to '-----' in x. the change is at the very end of the long string of hyphens.

with open ('someName.txt','r') as fo:

    start=1
    cntr=0
    for x in fo.read().split("\n"):
        if ('-----------------------------------------------------'
            '-----------------------------------------------------'
            '-----------------------------------------------------'
            '------------------------------------------------') in x:
            start = 1
            cntr += 1
            continue
        with open (str(cntr)+'.txt','a+') as opf:
            if not start:
                x = '\n'+x
            opf.write(x)
            start = 0

An alternative solution would be to use regular expressions. For example...

import re

with open('someName.txt', 'rt') as fo:
    counter = 0
    pattern = re.compile(r'--+')  # this is the regex pattern
    for group in re.split(pattern, fo.read()):
        # the re.split function used in the loop splits text by the pattern
        with open(str(counter)+'.txt','a+') as opf:
            opf.write(group)
        counter += 1

edited Mar 17 '22 at 05:26

answered Mar 15 '22 at 02:24

Alexander

16,091
5
13
29

1

This is working much better! Now the only issues I'm encountering are for the hyphen line separators that are preceded by double or triple digits ie. '9--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- vs 10---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------' Hyphen count then -1 – DataMiner_NLP Mar 15 '22 at 08:32
I'm not sure what you mean... can you provide a larger sample string – Alexander Mar 15 '22 at 23:06
1

`Some long rambling text about something 9--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Something else 10-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------` See above where double digit is used,its one less hypen. – DataMiner_NLP Mar 16 '22 at 13:12
1

Unfortunately this code block is loosing format. The format is as top example provided where the hyphen lines are actuall below the text, not in line as this makes it appear. – DataMiner_NLP Mar 16 '22 at 13:15
I updated the answer... let me know if your still having trouble – Alexander Mar 17 '22 at 05:27

Using python to separate a long text file into multiple files based on hyphen line separators?

1 Answers1