5

A recent question about splitting a binary file using null characters made me think of a similar text-oriented question.

Given the following file:

Parse me using spaces, please.

Using Raku, I can parse this file using space (or any chosen character) as the input newline character, thus:

my $fh = open('spaced.txt', nl-in => ' ');

while $fh.get -> $line {
    put $line;
}

Or more concisely:

.put for 'spaced.txt'.IO.lines(nl-in => ' ');

Either of which gives the following result:

Parse
me
using
spaces,
please.

Is there something equivalent in Python 3?

The closest I could find required reading an entire file into memory:

for line in f.read().split('\0'):
    print line

Update: I found several other older questions and answers that seemed to indicate that this isn't available, but I figured there may have been new developments in this area in the last several years:
Python restrict newline characters for readlines()
Change newline character .readline() seeks

Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99

3 Answers3

3

There is no builtin support to read a file splitted by a custom character.

However loading a file with the "U"-flag allows universal newline-character, which can be obtained by file.newlines. It keeps the newline-mode in the whole file.

Here is my generator to read a file, while not everything in memory:

def customReadlines(fileNextBuff, char):
    """
        \param fileNextBuff a function returning the next buffer or "" on EOF
        \param char a string with the lines are splitted, the char is not included in the yielded elements
    """
    lastLine = ""
    lenChar = len(char)
    while True:
         thisLine = fileNextBuff
         if not thisLine: break #EOF
         fnd = thisLine.find(char)
         while fnd != -1:
             yield lastLine + thisLine[:fnd]
             lastLine = ""
             thisLine = thisLine[fnd+lenChar:]
             fnd = thisLine.find(char)
         lastLine+= thisLine
    yield lastLine


### EXAMPLES ###

#open file.txt and print each part of the file ending with Null-terminator by loading a buffer of 256 characters
with open("file.bin", "r") as f:
    for l in customReadlines((lambda: f.read(0x100)), "\0"):
        print(l)

# open the file errors.log and split the file with a special string, while it loads a whole line at a time
with open("errors.log", "r") as f:
    for l in customReadlines(f.readline, "ERROR:")
        print(l)
        print(" " + '-' * 78) # some seperator
cmdLP
  • 1,658
  • 9
  • 19
1

Would this one do what you need?

def newreadline(f, newlinechar='\0'):
    c = f.read(1)
    b = [c]
    while(c != newlinechar and c != ''):
        c = f.read(1)
        b.append(c)
    return ''.join(b)

EDIT: added a replacement for readlines():

def newreadlines(f, newlinechar='\0'):
    line = newreadline(f, newlinechar)
    while line:
        yield line
        line = newreadline(f, newlinechar)

so that OP can do the following:

for line in newreadlines(f, newlinechar='\0'):
    print(line)
AGN Gazer
  • 8,025
  • 2
  • 27
  • 45
  • This gives me individual characters when used thus: ``` with open('spaced.txt','r') as f: for line in newreadline(f): print(line) ``` Sorry about the formatting. – Christopher Bottoms Aug 04 '17 at 12:29
  • Yes, `newreadline` is intended as a replacement for `readline()` - it reads a single line. You would get the same result from `readline()`: `with open('spaced.txt','r') as f: for line in f.readline(): print(line)` - it will print *individual characters* - not lines! If you want to read *all lines* then you should use your own suggestion: `for line in f.read().split('\0'): print(line)`. – AGN Gazer Aug 04 '17 at 13:18
  • If you want to print all lines using my function (i.e., not loading the entire file into memory) do this: `with open('spaced.txt','r') as f: line = newreadline(f); while line: print(line); line = newreadline(f)`. Alternatively you can create a generator based on my function that would behave as **`readlines()`** – AGN Gazer Aug 04 '17 at 13:20
  • I have edited my answer to include an example of generator: `newreadlines()`. Also I have modified `newreadline()` to keep the new line character in the returned string in order to behave similarly with built-in `readline()`. – AGN Gazer Aug 04 '17 at 13:55
0
def parse(fp, split_char, read_size=16):
    def give_chunks():
        while True:
            stuff = fp.read(read_size)
            if not stuff:
                break
            yield stuff
    leftover = ''
    for chunk in give_chunks():
        *stuff, leftover =  (leftover + chunk).split(split_char)
        yield from stuff
    if leftover:
        yield leftover

If you are ok with splitting with new lines along with a split_char, below one works (like example reading a text file word by word)

def parse(fobj, split_char):
    for line in fobj:
        yield from line.split(split_char)

In [5]: for word in parse(open('stuff.txt'), ' '):
   ...:     print(word)
balki
  • 26,394
  • 30
  • 105
  • 151
  • 1
    Your first snippet fails to join results across lines. Your second snippet silently discards any final leftovers. I'm not sure whether those are the only problems. – user2357112 Aug 03 '17 at 17:44
  • 1
    Agree first one is does not work in all cases. But for simple cases like reading word by word in text file or a csv file, it works. Fixed second one to not discard `leftover`. Thanks for the catch! – balki Aug 03 '17 at 18:02
  • Note: I swapped first and second to put the more correct one on top – balki Aug 16 '17 at 21:19