Extract Values between two strings in a text file using python

Question

Lets say I have a Text file with the below content

fdsjhgjhg
fdshkjhk
Start
Good Morning
Hello World
End
dashjkhjk
dsfjkhk

Now I need to write a Python code which will read the text file and copy the contents between Start and end to another file.

I wrote the following code.

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
keepCurrentSet = True
for line in inFile:
    buffer.append(line)
    if line.startswith("Start"):
        #---- starts a new data set
        if keepCurrentSet:
            outFile.write("".join(buffer))
        #now reset our state
        keepCurrentSet = False
        buffer = []
    elif line.startswith("End"):
        keepCurrentSet = True
inFile.close()
outFile.close()

I'm not getting the desired output as expected I'm just getting Start What I want to get is all the lines between Start and End. Excluding Start & End.

1

Are these text files large? – TerryA Sep 18 '13 at 06:14

inspectorG4dget · Accepted Answer · 2019-05-11T18:58:57.727

57

Just in case you have multiple "Start"s and "End"s in your text file, this will import all the data together, excluding all the "Start"s and "End"s.

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    copy = False
    for line in infile:
        if line.strip() == "Start":
            copy = True
            continue
        elif line.strip() == "End":
            copy = False
            continue
        elif copy:
            outfile.write(line)

edited May 11 '19 at 18:58

answered Sep 18 '13 at 06:17

inspectorG4dget

110,290
27
149
241

Dears,Thanks for your response I applied the same on real scenerio, I got the following error D:\Python>Python.exe First.py Traceback (most recent call last): File "First.py", line 3, in for line in infile: File "D:\Python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4591: cha racter maps to Can you help me out with this – user2790219 Sep 18 '13 at 07:45
@user2790219: That's not an error with this code. If you could post the text file that you are using, someone might be able to help (I think you should make that a new question) – inspectorG4dget Sep 18 '13 at 14:19
1

This code will not include the strings "Start" and "End" just what is inside them. How would you include the perimeter strings? – johnnydrama Mar 02 '16 at 19:28
@johnnydrama: simply add the `outfile.write` line within the first two if blocks as well – inspectorG4dget Mar 02 '16 at 22:40
Simple and efficient, good inspiration for what I had to do. GO-GO Gadget Python! Thks, Gadget. – tisc0 Jan 05 '19 at 00:34
hi! If the file is large, you would not want to continue scanning it after "End", so I would suggest replacing `copy = False` below `elif line.strip() == "End":` by a simple `break` statement. – jeannej May 08 '19 at 14:35
1

That's a good observation. However, the presented code is meant grab all the data from multiple instances of "Start" and "End". I've updated my answer to explicitly state that assumption – inspectorG4dget May 11 '19 at 19:00
Hello there inspectorG4dget. You said the code will 'excluding all the "Start"s and "End"s'. I tried it and it does just that. How can the code be modified to copy the lines with 'Start' and 'End'? – ASH Aug 12 '19 at 20:33
@asher: Include `outfile.write(line)` within the other if/elif blocks before `continue` – inspectorG4dget Aug 14 '19 at 17:59
By exchanging True and False, you can grab "header" and "footer" while dumping the intermediate content. :-) – PatrickT Apr 27 '20 at 07:41
Why do you need `continue`s? – user5054 Apr 06 '21 at 15:37
@user5054: Actually, I don't. I guess I forgot to remove them when I updated my code – inspectorG4dget Apr 06 '21 at 17:08
What if you just want to get just the fist block in between "Start" and "End"? – DPdl Feb 09 '22 at 00:00
@DPdl: put a `break` inside the first `elif` – inspectorG4dget Feb 09 '22 at 02:59

score 8 · Answer 2 · answered Sep 18 '13 at 06:18

8

If the text files aren't necessarily large, you can get the whole content of the file then use regular expressions:

import re
with open('data.txt') as myfile:
    content = myfile.read()

text = re.search(r'Start\n.*?End', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
    myfile2.write(text)

answered Sep 18 '13 at 06:18

TerryA

58,805
11
114
143

1

Regex is way overkill for this problem. Also, you don't handle the case where one of the lines is `Ender's Game` (the `End` in the regex needs a newline). Further, the usage of `\n` is not cross-platform, as windows uses `\r\n` for line endings – inspectorG4dget Sep 18 '13 at 06:24
1

@inspectorG4dget From my experience, regular expressions are never overkill. If you're good with a dialect, it will have predictable behavior. Using them helps to maintain your skills, which is good because they are robust enough to handle nearly every text operation. Still, your answer is elegant and rocks +1. – Jonathan Komar Nov 23 '16 at 09:15

score 4 · Answer 3 · answered Sep 18 '13 at 06:18

I'm not a Python expert, but this code should do the job.

inFile = open("data.txt")
outFile = open("result.txt", "w")
keepCurrentSet = False
for line in inFile:
    if line.startswith("End"):
        keepCurrentSet = False

    if keepCurrentSet:
        outFile.write(line)

    if line.startswith("Start"):
        keepCurrentSet = True
inFile.close()
outFile.close()

falsetru · Answer 4 · 2013-09-18T07:00:16.830

Using itertools.dropwhile, itertools.takewhile, itertools.islice:

import itertools

with open('data.txt') as f, open('result.txt', 'w') as fout:
    it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
    it = itertools.islice(it, 1, None)
    it = itertools.takewhile(lambda line: line.strip() != 'End', it)
    fout.writelines(it)

UPDATE: As inspectorG4dget commented, above code copies over the first block. To copy multiple blocks, use following:

import itertools

with open('data.txt', 'r') as f, open('result.txt', 'w') as fout:
    while True:
        it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
        if next(it, None) is None: break
        fout.writelines(itertools.takewhile(lambda line: line.strip() != 'End', it))

Two issues: (1) `\n` is not cross-platform - Windows uses `\r\n`. (2) This doesn't handle multiple blocks at all - it only copies over the first block — inspectorG4dget, Sep 18 '13 at 06:22

score 2 · Answer 5 · answered Sep 18 '13 at 06:19

Move the outFile.write call into the 2nd if:

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
for line in inFile:
    if line.startswith("Start"):
        buffer = ['']
    elif line.startswith("End"):
        outFile.write("".join(buffer))
        buffer = []
    elif buffer:
        buffer.append(line)
inFile.close()
outFile.close()

score 1 · Answer 6 · answered Sep 18 '13 at 06:49

1

import re

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer1 = ""
keepCurrentSet = True
for line in inFile:
    buffer1=buffer1+(line)

buffer1=re.findall(r"(?<=Start) (.*?) (?=End)", buffer1)  
outFile.write("".join(buffer1))  
inFile.close()
outFile.close()

answered Sep 18 '13 at 06:49

Gaurav

11
3

This will fail in cases where the lines `Starting awesome sentence` and `Ender's Game` exist in the file – inspectorG4dget Sep 18 '13 at 06:59

score 1 · Answer 7 · answered Sep 18 '13 at 06:51

1

I would handle it like this :

inFile = open("data.txt")
outFile = open("result.txt", "w")

data = inFile.readlines()

outFile.write("".join(data[data.index('Start\n')+1:data.index('End\n')]))
inFile.close()
outFile.close()

answered Sep 18 '13 at 06:51

user2787688

11
1

1

Very inefficient use of memory in the worst case, and doesn't handle multiple blocks – inspectorG4dget Sep 18 '13 at 06:58

Gangadhar Kadam · Answer 8 · 2021-06-25T20:36:50.253

if one wants to keep the start and end lines/keywords while extracting the lines between 2 strings.

Please find below the code snippet that I used to extract sql statements from a shell script

def process_lines(in_filename, out_filename, start_kw, end_kw):
    try:
        inp = open(in_filename, 'r', encoding='utf-8', errors='ignore')
        out = open(out_filename, 'w+', encoding='utf-8', errors='ignore')
    except FileNotFoundError as err:
        print(f"File {in_filename} not found", err)
        raise
    except OSError as err:
        print(f"OS error occurred trying to open {in_filename}", err)
        raise
    except Exception as err:
        print(f"Unexpected error opening {in_filename} is",  repr(err))
        raise
    else:
        with inp, out:
            copy = False
            for line in inp:
                # first IF block to handle if the start and end on same line
                if line.lstrip().lower().startswith(start_kw) and line.rstrip().endswith(end_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    copy = False
                    continue
                elif line.lstrip().lower().startswith(start_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    continue
                elif line.rstrip().endswith(end_kw):
                    if copy:  # keep the ends with keyword
                        out.write(line)
                    copy = False
                    continue
                elif copy:
                    # write
                    out.write(line)


if __name__ == '__main__':
    infile = "/Users/testuser/Downloads/testdir/BTEQ_TEST.sh"
    outfile = f"{infile}.sql"
    statement_start_list = ['database', 'create', 'insert', 'delete', 'update', 'merge', 'delete']
    statement_end = ";"
    process_lines(infile, outfile, tuple(statement_start_list), statement_end)

score 0 · Answer 9 · answered Aug 19 '21 at 11:39

0

Files are iterators in Python, so this means you don't need to hold a "flag" variable to tell you what lines to write. You can simply use another loop when you reach the start line, and break it when you reach the end line:

with open("data.txt") as in_file, open("result.text", 'w') as out_file:
    for line in in_file:
        if line.strip() == "Start":
            for line in in_file:
                if line.strip() == "End":
                    break
                out_file.write(line)

answered Aug 19 '21 at 11:39

Tomerikoo

18,379
16
47
61

what if it's the same keyword, and I want to extract everything in between the 2nd appearance of that word – uniquegino Nov 29 '22 at 22:55
@uniquegino In that case you can add a "'flag" variable to count the keyword and enter the second loop when the count satisfies your condition – Tomerikoo Nov 30 '22 at 10:22
yes flag does work, thank you. found a similar idea here that' works good https://sopython.com/canon/92/extract-text-from-a-file-between-two-markers/ – uniquegino Nov 30 '22 at 21:14

Extract Values between two strings in a text file using python

9 Answers9

Linked

Related