How to efficiently split a text file according to certain characters?

Question

I have recently started learning Python3, solely trying to improve efficiency for my work. And this may be possibly an extremely basic question.

I know for strings, we can use str.split to split the string into pieces, according to a given character.

But how might I go for this.

With a file bigfile.txt, some of the lines say

some intro lines xxxxxx
sdafiefisfhsaifdijsdjsia
dsafdsifdsiod

\item 12478621376321748324
sdfasfsdfafda

\item 23847328412834723
uduhfavfduhfu
sduhfhaiuesfhseuif
lots and other lines


\item 328347848732
pewprpewposdp
everthing up to and inclued this line
and the blank line too

some end lines dsahudfuha
dsfdsfdsf

What's of interest are the lines starting with \item xxxxx and afterwards, before another \item xxxxx

How to efficiently split bigfile.txt so I have the following:

bigfile_part1.txt which contains

\item 12478621376321748324
sdfasfsdfafda

bigfile_part2.txt which contains

\item 23847328412834723
uduhfavfduhfu
sduhfhaiuesfhseuif
lots and other lines

bigfile_part3.txt which contains

\item 328347848732
pewprpewposdp
everthing up to and inclued this line
and the blank line too

ignoring the intro lines as well as the end lines.

Moreover, how can I apply this function to split batch files, say

bigfile2.txt
bigfile3.txt
bigfile4.txt

in exactly the same way.

try using [regular expressions](https://docs.python.org/3.8/library/re.html) — Hadrian, Aug 20 '20 at 17:53

tdelaney · Answer 1 · 2020-08-21T15:49:29.353

You can use itertools.groupby to carve up the file. groupby creates subiterators whenever a condition changes. In your case that's whether a line starts with "\item ".

import itertools

records = []
record = None

for key, subiter in itertools.groupby(open('thefile'),
        lambda line: line.startswith("\item ")):
    if key:
        # in a \item group, which has 1 line
        item_id = next(subiter).split()[1]
        record = {"item_id":item_id}
    else:
        # in the the value subgroup
        if record:
            record["values"] = [line.strip() for line in subiter]
            records.append(record)

for record in records:
    print(record)

As for processing multiple files, you could put that into a function to be called once per file. Then its a question of getting the file list. Perhaps glob.glob("some/path/big*.txt").

Tried to spot and fix some errors, but your code won't run...sadly... — CasperYC, Aug 21 '20 at 06:40

score 1 · Answer 2 · answered Aug 20 '20 at 18:12

Another approach to split based on newline characters,

import re

text = """some intro lines xxxxxx
sdafiefisfhsaifdijsdjsia
dsafdsifdsiod

\item 12478621376321748324
sdfasfsdfafda
...
"""

# split by newline characters
for i, j in enumerate(re.split('\n{2,}', text)):
   if j.startswith("\item"):
       print(f"bigfile{i}.txt", j, sep="\n") # dump to file here

bigfile1.txt
\item 12478621376321748324
sdfasfsdfafda

bigfile2.txt
\item 23847328412834723
uduhfavfduhfu
sduhfhaiuesfhseuif
lots and other lines

bigfile3.txt
\item 328347848732
pewprpewposdp
everthing up to and inclued this line
and the blank line too

Sarwagya · Accepted Answer · 2020-08-22T17:57:47.390

1

Since it's a big file, instead of reading entire file into a string, let us try reading the file line by line.

import sys
def parseFromFile(filepath):
    parsedListFromFile = []
    unended_item = False
    with open(filepath) as fp:
        line = fp.readline()
        while line:
            if line.find("\item")!=-1 or unended_item: 
                if line.find("\item") != -1: #says that there is \item present in line
                    parsedListFromFile.append("\item"+line.split("\item")[-1])
                    unended_item=True  
                else:
                    parsedListFromFile[-1]+=line.split("\item")[-1]
            line = fp.readline()               
    #write each item of parseListFromFile to file
    for index, item in enumerate(parsedListFromFile):
        with open(filepath+str(index)+".txt", 'w') as out:
            out.write(item + '\n')

def main():
    #assuming you run script like this: pythonsplit.py myfile1.txt myfile2.txt ...
    paths = sys.argv[1:] #this gets all cli args after pythonsplit.py
    for path in paths:
        parseFromFile(path) #call function for each file

if __name__ == "__main__": main()

*Assuming one line only has one \item in it. *This doesn't ignore the end line. You can put an if or just manually remove it from the last file.

edited Aug 22 '20 at 17:57

answered Aug 20 '20 at 18:14

Sarwagya

166
11

Thanks. Your code worked as intended the most. Is there a way to keep `\item` in the final files as now they are removed as methods for `str.spilt` (just noticed that in my new lesson). Also how to use it as a function of `python` file so I can run it like `pythonsplit.py myfile1.txt` ? – CasperYC Aug 21 '20 at 06:32
Look at the updated code. When [running a python scirpt from CLI](https://stackoverflow.com/questions/38734521/python-2-and-python-3-running-in-command-prompt) the script begins from __main__() The code I wrote just loops through all file paths passed in through the CLI and calls parseFromFile on each one of them :) – Sarwagya Aug 22 '20 at 17:39
*When running a python scirpt from CLI \_\_name\_\_ is set to "\_\_main\_\_", so the if condition calls main(). – Sarwagya Aug 22 '20 at 18:02
Perfect for my purpose! I now have enough to play around with!! Thanks! – CasperYC Aug 24 '20 at 19:21

How to efficiently split a text file according to certain characters?

3 Answers3