How to read a file word by word

Question

I have a PPM file that I need to do certain operations on. The file is structured as in the following example. The first line, the 'P3' just says what kind of document it is. In the second line it gives the pixel dimension of an image, so in this case it's telling us that the image is 480x640. In the third line it declares the maximum value any color can take. After that there are lines of code. Every three integer group gives an rbg value for one pixel. So in this example, the first pixel has rgb value 49, 49, 49. The second pixel has rgb value 48, 48, 48, and so on.

P3
480 640
255
49   49   49   48   48   48   47   47   47   46   46   46   45   45   45   42   42   42   38   38   
38   35   35   35   23   23   23   8   8   8   7   7   7   17   17   17   21   21   21   29   29   
29   41   41   41   47   47   47   49   49   49   42   42   42   33   33   33   24   24   24   18   18   
...

Now as you may notice, this particular picture is supposed to be 640 pixels wide which means 640*3 integers will provide the first row of pixels. But here the first row is very, very far from containing 640*3 integers. So the line-breaks in this file are meaningless, hence my problem.

The main way to read Python files is line-by-line. But I need to collect these integers into groups of 640*3 and treat that like a line. How would one do this? I know I could read the file in line-by-line and append every line to some list, but then that list would be massive and I would assume that doing so would place an unacceptable burden on a device's memory. But other than that, I'm out of ideas. Help would be appreciated.

@thefourtheye it varies from line to line in no obviously systematic way. — Addem, Oct 16 '14 at 04:29
related: [How to read tokens without reading whole line or file](http://stackoverflow.com/q/20019503/4279) — jfs, Oct 16 '14 at 05:35
related: [What is the most “pythonic” way to iterate over a list in chunks?](http://stackoverflow.com/q/434287/4279) — jfs, Oct 16 '14 at 05:36

score 3 · Answer 1 · edited May 23 '17 at 12:20

To read three space-separated word at a time from a file:

with open(filename, 'rb') as file:
    kind, dimensions, max_color = map(next, [file]*3) # read 3 lines
    rgbs = zip(*[(int(word) for line in file for word in line.split())] * 3)

Output

[(49, 49, 49),
 (48, 48, 48),
 (47, 47, 47),
 (46, 46, 46),
 (45, 45, 45),
 (42, 42, 42),
 ...

See What is the most “pythonic” way to iterate over a list in chunks?

To avoid creating the list at once, you could use itertools.izip() that would allow to read one rgb value at a time.

Anthony · Accepted Answer · 2014-10-16T05:16:29.800

Probably not the most 'pythonic' way but...

Iterate through the lines containing integers.

Keep four counts - a count of 3 - color_code_count, a count of 1920 - numbers_processed, a count - col (0-639), and another - rows (0-479).

For each integer you encounter, add it to a temporary list at index of list[color_code_count]. Increment color_code_count, col, and numbers_processed.

Once color_code_count is 3, you take your temporary list and create a tuple 3 or triplet (not sure what the term is but your structure will look like (49,49,49) for the first pixel), and add that to a list of 640 columns, and 480 rows - insert your (49, 49, 49) into pixels[col][row].

Increment col. Reset color_code_count.
'numbers_processed' will continue to increment until you get to 1920.

Once you hit 1920, you've reached the end of the first row.
Reset numbers_processed and col to zero, increment row by 1.

By this point, you should have 640 tuple3s or triplets in the row zero starting with (49,49,49), (48, 48, 48), (47, 47, 47), etc. And you're now starting to insert pixel values in row 1 column 0.

Like I said, probably not the most 'pythonic' way. There are probably better ways of doing this using join and map but I think this might work? This 'solution' if you want to call it that, shouldn't care about number of integers on any line since you're keeping count of how many numbers you expect to run through (1920) before you start a new row.

Yep, I'm probably going to implement something like this, thanks! — Addem, Oct 16 '14 at 05:36

score 0 · Answer 3 · answered Oct 16 '14 at 05:10

A possible way to go through each word is to iterate through each line then .split it into each word.

the_file = open("file.txt",r)

for line in the_file:
    for word in line.split():
        #-----Your Code-----

From there you can do whatever you want with your "words." You can add if-statements to check if there are numbers in each line with: (Though not very pythonic)

for line in the_file:
    if "1" not in line or "2" not in line ...:
        for word in line.split():
            #-----Your Code-----

Or you can test if there is anything in each line: (Much more pythonic)

for line in the_file:
    for word in line.split():
        if len(word) != 0 or word != "\n":
            #-----Your Code-----

I would recommend adding each of your new "lines" to a new document.

score 0 · Answer 4 · answered Oct 16 '14 at 05:32

I am a C programmer. Sorry if this code looks like C Style:

f = open("pixel.ppm", "r")
type = f.readline()
height, width = f.readline().split()
height, width = int(height), int(width)
max_color = int(f.readline());
colors = []
count = 0
col_count = 0
line = []
while(col_count < height):
    count = 0
    i = 0
    row =[]
    while(count < width * 3):
        temp = f.readline().strip()
        if(temp == ""):
            col_count = height
            break
        temp = temp.split()
        line.extend(temp)
        i = 0
        while(i + 2 < len(line)):
            row.append({'r':int(line[i]),'g':int(line[i+1]),'b':int(line[i+2])})
            i = i+3
            count = count +3
            if(count >= width *3):
                break
        if(i < len(line)):
            line = line[i:len(line)]
        else:
            line = []
    col_count += 1
    colors.append(row)
for row in colors:
    for rgb in row:
        print(rgb)
    print("\n")

You can tweak this according to your needs. I tested it on this file:

P4
3 4
256
4 5 6 4 7 3
2 7 9 4
2 4
6 8 0 
3 4 5 6 7 8 9 0 
2 3 5 6 7 9 2 
2 4 5 7 2 
2

score 0 · Answer 5 · answered Oct 16 '14 at 05:46

This seems to do the trick:

from re import findall

def _split_list(lst, i):
    return lst[:i], lst[i:]

def iter_ppm_rows(path):
    with open(path) as f:
        ftype = f.readline().strip()
        h, w = (int(s) for s in f.readline().split(' '))
        maxcolor = int(f.readline())

        rlen = w * 3
        row = []
        next_row = []

        for line in f:
            line_ints = [int(i) for i in findall('\d+\s+', line)]

            if not row:
                row, next_row = _split_list(line_ints, rlen)
            else:
                rest_of_row, next_row = _split_list(line_ints, rlen - len(row))
                row += rest_of_row

            if len(row) == rlen:
                yield row
                row = next_row
                next_row = []

It isn't very pretty, but it allows for varying whitespace between numbers in the file, as well as varying line lengths.

I tested it on a file that looked like the following:

P3
120 160
255
0   1   2   3   4   5   6   7   
8   9   10   11   12   13   
14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   
[...]
9993   9994   9995   9996   9997   9998   9999

That file used random line lengths, but printed numbers in order so it was easy to tell at what value the rows began and stopped. Note that its dimensions are different than in the question's example file.

Using the following test code...

for row in iter_ppm_rows('mock_ppm.txt'): 
    print(len(row), row[0], row[-1])

...the result was the following, which seems to not be skipping over any data and returning rows of the right size.

480 0 479
480 480 959
480 960 1439
480 1440 1919
480 1920 2399
480 2400 2879
480 2880 3359
480 3360 3839
480 3840 4319
480 4320 4799
480 4800 5279
480 5280 5759
480 5760 6239
480 6240 6719
480 6720 7199
480 7200 7679
480 7680 8159
480 8160 8639
480 8640 9119
480 9120 9599

As can be seen, trailing data at the end of the file that can't represent a complete row was not yielded, which was expected but you'd likely want to account for it somehow.

How to read a file word by word

5 Answers5

Output