18

I have a text file dnw.txt structured as:

date
downland

user 

date data1 date2
201102 foo bar 200 50
201101 foo bar 300 35

So the first six lines of file are not needed.

I know I can open the file with

f = open('dwn.txt', 'rb')

How do I "split" this file starting at line 7 to EOF?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Merlin
  • 24,552
  • 41
  • 131
  • 206

11 Answers11

41
with open('dwn.txt') as f:
    for i in range(6):
        next(f)
    for line in f:
        process(line)

(In Python 2, use xrange instead of range, and f.next() instead of next(f).)

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
John Machin
  • 81,303
  • 11
  • 141
  • 189
  • 2
    @user428862: `process(line)` is pseudocode for "insert your own code here to do whatever you want with `line`". What kind of code is "ur" code? – John Machin Feb 01 '11 at 23:06
11

Itertools answer!

from itertools import islice

with open('foo') as f:
    for line in islice(f, 6, None):
        print line
Josh Lee
  • 171,072
  • 38
  • 269
  • 275
6

Python 3:

with open("file.txt","r") as f:
    for i in range(6):
        f.readline()
    for line in f:
        # process lines 7-end
KiteCoder
  • 2,364
  • 1
  • 13
  • 29
  • Basically I think you are pushing the 'cursor' forward 6 times: one for each of "list(range(6)) or [0, 1, 2, 3, 4, 5]". Hence line 7 is next. Then start processing. Clever if I understand correctly. – jouell Sep 07 '19 at 00:12
5
with open('test.txt', 'r') as fo:
   for i in xrange(6):
       fo.next()
   for line in fo:
       print "%s" % line.strip()
systempuntoout
  • 71,966
  • 47
  • 171
  • 241
3

In fact, to answer precisely at the question as it was written

How do I "split" this file starting at line 7 to EOF?

you can do

:

in case the file is not big:

with open('dwn.txt','rb+') as f:
    for i in xrange(6):
        print f.readline()
    content = f.read()
    f.seek(0,0)
    f.write(content)
    f.truncate()

in case the file is very big

with open('dwn.txt','rb+') as ahead, open('dwn.txt','rb+') as back:
    for i in xrange(6):
        print ahead.readline()

    x = 100000
    chunk = ahead.read(x)
    while chunk:
        print repr(chunk)
        back.write(chunk)
        chunk = ahead.read(x)
    back.truncate()

The truncate() function is essential to put the EOF you asked for. Without executing truncate() , the tail of the file, corresponding to the offset of 6 lines, would remain.

.

The file must be opened in binary mode to prevent any problem to happen.

When Python reads '\r\n' , it transforms them in '\n' (that's the Universal Newline Support, enabled by default) , that is to say there are only '\n' in the chains chunk even if there were '\r\n' in the file.

If the file is from Macintosh origin , it contains only CR = '\r' newlines before the treatment but they will be changed to '\n' or '\r\n' (according to the platform) during the rewriting on a non-Macintosh machine.

If it is a file from Linux origin, it contains only LF = '\n' newlines which, on a Windows OS, will be changed to '\r\n' (I don't know for a Linux file processed on a Macintosh ). The reason is that the OS Windows writes '\r\n' whatever it is ordered to write , '\n' or '\r' or '\r\n'. Consequently, there would be more characters rewritten than having been read, and then the offset between the file's pointers ahead and back would diminish and cause a messy rewriting.

In HTML sources , there are also various newlines.

That's why it's always preferable to open files in binary mode when they are so processed.

eyquem
  • 26,771
  • 7
  • 38
  • 46
2

Alternative version

You can direct use the command read() if you know the character position pos of the separating (header part from the part of interest) linebreak, e.g. an \n, in the text at which you want to break your input text:

with open('input.txt', 'r') as txt_in:
    txt_in.seek(pos)
    second_half = txt_in.read()

If you are interested in both halfs, you could also investigate the following method:

with open('input.txt', 'r') as txt_in:
    all_contents = txt_in.read()
first_half = all_contents[:pos]
second_half = all_contents[pos:]
Community
  • 1
  • 1
strpeter
  • 2,562
  • 3
  • 27
  • 48
0

You can read the entire file into an array/list and then just start at the index appropriate to the line you wish to start reading at.

f = open('dwn.txt', 'rb')
fileAsList = f.readlines()
fileAsList[0] #first line
fileAsList[1] #second line
Convolution
  • 2,351
  • 17
  • 24
0
#!/usr/bin/python

with open('dnw.txt', 'r') as f:
    lines_7_through_end = f.readlines()[6:]

print "Lines 7+:"
i = 7;
for line in lines_7_through_end:
    print "    Line %s: %s" % (i, line)
    i+=1

Prints:

Lines 7+:

  Line 7: 201102 foo bar 200 50

  Line 8: 201101 foo bar 300 35

Edit:

To rebuild dwn.txt without the first six lines, do this after the above code:

with open('dnw.txt', 'w') as f:
    for line in lines_7_through_end:
        f.write(line)
Cuga
  • 17,668
  • 31
  • 111
  • 166
0

I have created a script used to cut an Apache access.log file several times a day. It's not original topic of question, but I think it can be useful, if you have store the file cursor position after the 6 first lines reading.

So I needed the set a position cursor on last line parsed during last execution. To this end, I used file.seek() and file.seek() methods which allows the storage of the cursor in file.

My code :

ENCODING = "utf8"
CURRENT_FILE_DIR = os.path.dirname(os.path.abspath(__file__))

# This file is used to store the last cursor position
cursor_position = os.path.join(CURRENT_FILE_DIR, "access_cursor_position.log")

# Log file with new lines
log_file_to_cut = os.path.join(CURRENT_FILE_DIR, "access.log")
cut_file = os.path.join(CURRENT_FILE_DIR, "cut_access", "cut.log")

# Set in from_line 
from_position = 0
try:
    with open(cursor_position, "r", encoding=ENCODING) as f:
        from_position = int(f.read())
except Exception as e:
    pass

# We read log_file_to_cut to put new lines in cut_file
with open(log_file_to_cut, "r", encoding=ENCODING) as f:
    with open(cut_file, "w", encoding=ENCODING) as fw:
        # We set cursor to the last position used (during last run of script)
        f.seek(from_position)
        for line in f:
            fw.write("%s" % (line))

    # We save the last position of cursor for next usage
    with open(cursor_position, "w", encoding=ENCODING) as fw:
        fw.write(str(f.tell()))
Samuel Dauzon
  • 10,744
  • 13
  • 61
  • 94
-1

Just do f.readline() six times. Ignore the returned value.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
  • did you tried doing it yourself? how on a freaking earth this answer could have two upvotes? are there some evil perl hackers upvoting or something? – SilentGhost Feb 01 '11 at 15:56
  • I meant f.readline(). .next() is nicer though. You guys win. I lose. – Spacedman Feb 01 '11 at 16:03
  • Although if you .next() and then try .readline() get a ValueError for mixing iteration and read methods. – Spacedman Feb 01 '11 at 16:06
  • 2
    You've downvoted 'readlines()' solutions for valid reasons explained, but why downvote a readline() [times 6] solution? Surely this doesn't read the whole file. Note also my issue with .next() and then .readline(). – Spacedman Feb 01 '11 at 18:22
  • @Spacedman: because readline() is old hat and because of the very issue that you mention – John Machin Feb 01 '11 at 20:26
-1

Solutions with readlines() are not satisfactory in my opinion because readlines() reads the entire file. The user will have to read again the lines (in file or in the produced list) to process what he wants, while it could have been done without having read the intersting lines already a first time. Moreover if the file is big, the memory is weighed by the file's content while a for line in file instruction would have been lighter.

Doing repetition of readline() can be done like that

nb = 6
exec( nb * 'f.readline()\n')

It's short piece of code and nb is programmatically adjustable

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • are you serious? `exec`. in all fairness! – SilentGhost Feb 01 '11 at 17:49
  • 3
    +1 for not reading the whole file into memory, -100 for using `exec` – John Machin Feb 01 '11 at 17:56
  • What is there against exec() ? It's still in Python 3; if it was as much bad as xreadlines() was, it would have been deprecated the same. I never use exec(), abut it seemed to me that in this case, it could shorten the code instead of writing 6 lines with readline() – eyquem Feb 01 '11 at 18:51
  • 1
    « Solutions with readlines() are not satisfactory in my opinion because readlines() reads the entire file. » Well, it can be discussed. It depends of the file and the objective. If a file is big and that only a few lines are interesting, it isn't a good idea to read the entire file before treat it in a re-reading. But if not big and all the lines put in a list simplify the code or whatever else, it could be acceptable. It depends. I am no more in agreement with myself. – eyquem Feb 01 '11 at 19:04