23

Lets say I have a Text file with the below content

fdsjhgjhg
fdshkjhk
Start
Good Morning
Hello World
End
dashjkhjk
dsfjkhk

Now I need to write a Python code which will read the text file and copy the contents between Start and end to another file.

I wrote the following code.

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
keepCurrentSet = True
for line in inFile:
    buffer.append(line)
    if line.startswith("Start"):
        #---- starts a new data set
        if keepCurrentSet:
            outFile.write("".join(buffer))
        #now reset our state
        keepCurrentSet = False
        buffer = []
    elif line.startswith("End"):
        keepCurrentSet = True
inFile.close()
outFile.close()

I'm not getting the desired output as expected I'm just getting Start What I want to get is all the lines between Start and End. Excluding Start & End.

user2790219
  • 233
  • 1
  • 2
  • 5

9 Answers9

57

Just in case you have multiple "Start"s and "End"s in your text file, this will import all the data together, excluding all the "Start"s and "End"s.

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    copy = False
    for line in infile:
        if line.strip() == "Start":
            copy = True
            continue
        elif line.strip() == "End":
            copy = False
            continue
        elif copy:
            outfile.write(line)
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • Dears,Thanks for your response I applied the same on real scenerio, I got the following error D:\Python>Python.exe First.py Traceback (most recent call last): File "First.py", line 3, in for line in infile: File "D:\Python\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4591: cha racter maps to Can you help me out with this – user2790219 Sep 18 '13 at 07:45
  • @user2790219: That's not an error with this code. If you could post the text file that you are using, someone might be able to help (I think you should make that a new question) – inspectorG4dget Sep 18 '13 at 14:19
  • 1
    This code will not include the strings "Start" and "End" just what is inside them. How would you include the perimeter strings? – johnnydrama Mar 02 '16 at 19:28
  • @johnnydrama: simply add the `outfile.write` line within the first two if blocks as well – inspectorG4dget Mar 02 '16 at 22:40
  • Simple and efficient, good inspiration for what I had to do. GO-GO Gadget Python! Thks, Gadget. – tisc0 Jan 05 '19 at 00:34
  • hi! If the file is large, you would not want to continue scanning it after "End", so I would suggest replacing `copy = False` below `elif line.strip() == "End":` by a simple `break` statement. – jeannej May 08 '19 at 14:35
  • 1
    That's a good observation. However, the presented code is meant grab all the data from multiple instances of "Start" and "End". I've updated my answer to explicitly state that assumption – inspectorG4dget May 11 '19 at 19:00
  • Hello there inspectorG4dget. You said the code will 'excluding all the "Start"s and "End"s'. I tried it and it does just that. How can the code be modified to copy the lines with 'Start' and 'End'? – ASH Aug 12 '19 at 20:33
  • @asher: Include `outfile.write(line)` within the other if/elif blocks before `continue` – inspectorG4dget Aug 14 '19 at 17:59
  • By exchanging True and False, you can grab "header" and "footer" while dumping the intermediate content. :-) – PatrickT Apr 27 '20 at 07:41
  • Why do you need `continue`s? – user5054 Apr 06 '21 at 15:37
  • @user5054: Actually, I don't. I guess I forgot to remove them when I updated my code – inspectorG4dget Apr 06 '21 at 17:08
  • What if you just want to get just the fist block in between "Start" and "End"? – DPdl Feb 09 '22 at 00:00
  • @DPdl: put a `break` inside the first `elif` – inspectorG4dget Feb 09 '22 at 02:59
8

If the text files aren't necessarily large, you can get the whole content of the file then use regular expressions:

import re
with open('data.txt') as myfile:
    content = myfile.read()

text = re.search(r'Start\n.*?End', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
    myfile2.write(text)
TerryA
  • 58,805
  • 11
  • 114
  • 143
  • 1
    Regex is way overkill for this problem. Also, you don't handle the case where one of the lines is `Ender's Game` (the `End` in the regex needs a newline). Further, the usage of `\n` is not cross-platform, as windows uses `\r\n` for line endings – inspectorG4dget Sep 18 '13 at 06:24
  • 1
    @inspectorG4dget From my experience, regular expressions are never overkill. If you're good with a dialect, it will have predictable behavior. Using them helps to maintain your skills, which is good because they are robust enough to handle nearly every text operation. Still, your answer is elegant and rocks +1. – Jonathan Komar Nov 23 '16 at 09:15
4

I'm not a Python expert, but this code should do the job.

inFile = open("data.txt")
outFile = open("result.txt", "w")
keepCurrentSet = False
for line in inFile:
    if line.startswith("End"):
        keepCurrentSet = False

    if keepCurrentSet:
        outFile.write(line)

    if line.startswith("Start"):
        keepCurrentSet = True
inFile.close()
outFile.close()
Rafi Kamal
  • 4,522
  • 8
  • 36
  • 50
4

Using itertools.dropwhile, itertools.takewhile, itertools.islice:

import itertools

with open('data.txt') as f, open('result.txt', 'w') as fout:
    it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
    it = itertools.islice(it, 1, None)
    it = itertools.takewhile(lambda line: line.strip() != 'End', it)
    fout.writelines(it)

UPDATE: As inspectorG4dget commented, above code copies over the first block. To copy multiple blocks, use following:

import itertools

with open('data.txt', 'r') as f, open('result.txt', 'w') as fout:
    while True:
        it = itertools.dropwhile(lambda line: line.strip() != 'Start', f)
        if next(it, None) is None: break
        fout.writelines(itertools.takewhile(lambda line: line.strip() != 'End', it))
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Two issues: (1) `\n` is not cross-platform - Windows uses `\r\n`. (2) This doesn't handle multiple blocks at all - it only copies over the first block – inspectorG4dget Sep 18 '13 at 06:22
2

Move the outFile.write call into the 2nd if:

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer = []
for line in inFile:
    if line.startswith("Start"):
        buffer = ['']
    elif line.startswith("End"):
        outFile.write("".join(buffer))
        buffer = []
    elif buffer:
        buffer.append(line)
inFile.close()
outFile.close()
pts
  • 80,836
  • 20
  • 110
  • 183
1
import re

inFile = open("data.txt")
outFile = open("result.txt", "w")
buffer1 = ""
keepCurrentSet = True
for line in inFile:
    buffer1=buffer1+(line)

buffer1=re.findall(r"(?<=Start) (.*?) (?=End)", buffer1)  
outFile.write("".join(buffer1))  
inFile.close()
outFile.close()
Gaurav
  • 11
  • 3
1

I would handle it like this :

inFile = open("data.txt")
outFile = open("result.txt", "w")

data = inFile.readlines()

outFile.write("".join(data[data.index('Start\n')+1:data.index('End\n')]))
inFile.close()
outFile.close()
0

if one wants to keep the start and end lines/keywords while extracting the lines between 2 strings.

Please find below the code snippet that I used to extract sql statements from a shell script

def process_lines(in_filename, out_filename, start_kw, end_kw):
    try:
        inp = open(in_filename, 'r', encoding='utf-8', errors='ignore')
        out = open(out_filename, 'w+', encoding='utf-8', errors='ignore')
    except FileNotFoundError as err:
        print(f"File {in_filename} not found", err)
        raise
    except OSError as err:
        print(f"OS error occurred trying to open {in_filename}", err)
        raise
    except Exception as err:
        print(f"Unexpected error opening {in_filename} is",  repr(err))
        raise
    else:
        with inp, out:
            copy = False
            for line in inp:
                # first IF block to handle if the start and end on same line
                if line.lstrip().lower().startswith(start_kw) and line.rstrip().endswith(end_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    copy = False
                    continue
                elif line.lstrip().lower().startswith(start_kw):
                    copy = True
                    if copy:  # keep the starts with keyword
                        out.write(line)
                    continue
                elif line.rstrip().endswith(end_kw):
                    if copy:  # keep the ends with keyword
                        out.write(line)
                    copy = False
                    continue
                elif copy:
                    # write
                    out.write(line)


if __name__ == '__main__':
    infile = "/Users/testuser/Downloads/testdir/BTEQ_TEST.sh"
    outfile = f"{infile}.sql"
    statement_start_list = ['database', 'create', 'insert', 'delete', 'update', 'merge', 'delete']
    statement_end = ";"
    process_lines(infile, outfile, tuple(statement_start_list), statement_end)

Gangadhar Kadam
  • 536
  • 1
  • 4
  • 15
0

Files are iterators in Python, so this means you don't need to hold a "flag" variable to tell you what lines to write. You can simply use another loop when you reach the start line, and break it when you reach the end line:

with open("data.txt") as in_file, open("result.text", 'w') as out_file:
    for line in in_file:
        if line.strip() == "Start":
            for line in in_file:
                if line.strip() == "End":
                    break
                out_file.write(line)
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • what if it's the same keyword, and I want to extract everything in between the 2nd appearance of that word – uniquegino Nov 29 '22 at 22:55
  • @uniquegino In that case you can add a "'flag" variable to count the keyword and enter the second loop when the count satisfies your condition – Tomerikoo Nov 30 '22 at 10:22
  • yes flag does work, thank you. found a similar idea here that' works good https://sopython.com/canon/92/extract-text-from-a-file-between-two-markers/ – uniquegino Nov 30 '22 at 21:14