9

A friend of mine needs to to read a lot of data (about 18000 data sets) that is all formatted annoyingly. Specifically the data is supposed to be 8 columns and ~ 8000 rows of data, but instead the data is delivered as columns of 7 with the last entry spilling into the first column of the next row.

In addition every ~30 rows there is only 4 columns. This is because some upstream program is reshaping a 200 x 280 array into the 7x8120 array.

My question is this: How can we read the data into a 8x7000 array. My usual arsenal of np.loadtxt and np.genfromtxt fail when there is an uneven number of columns.

Keep in mind that performance is a factor since this has to be done for ~18000 datafiles.

Here is a link to a typical data file: http://users-phys.au.dk/hha07/hk_L1.ref

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
HansHarhoff
  • 1,917
  • 2
  • 22
  • 32
  • To clarify: every 24 rows there's a 4 column row because of the continued "overflowing" of the eight colum into each next row. Right? Every block of 24*7 + 4 has 200 items, which is evenly divisible by 8. – Eduardo Ivanec Mar 22 '12 at 13:22
  • 1
    An example would be very useful. – Adam Matan Mar 22 '12 at 13:29
  • 1
    How about fixing the upstream program to output nice HDF5 files, or at least something less insane than this? – Sven Marnach Mar 22 '12 at 14:24

3 Answers3

12

An even easier approach I just thought of:

with open("hk_L1.ref") as f:
    data = numpy.array(f.read().split(), dtype=float).reshape(7000, 8)

This reads the data as a one-dimensional array first, completely ignoring all new-line characters, and then we reshape it to the desired shape.

While I think that the task will be I/O-bound anyway, this approach should use little processor time if it matters.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
1

Provided I understood you correctly (see my comment) you can split your input in tokens, then process it in blocks of eight indistinctly:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

f = open('filename.ref')
tokens = f.read().split()

rows = []
for idx, token in enumerate(tokens):
    if idx % 8 == 0:
        # this is a new row, use a new list.
        row = []
        rows.append(row)
    row.append(token)

# rows is now a list of lists with the desired data.

This runs in under 0.2 seconds in my computer as is.

Edit: used @SvenMarnach's suggestion.

Eduardo Ivanec
  • 11,668
  • 2
  • 39
  • 42
  • Why are you using `shlex` here? A simple `str.split()` would do the trick. – Sven Marnach Mar 22 '12 at 13:43
  • @SvenMarnach: I thought this was easier as it presents an endless stream of tokens from the file. With `line.split()` I would have to iterate over the file myself, keeping track of the current offset due to the 8-columns-in-7 issue the OP described. Either that or write a generator, I guess, but that's pretty much what I use shlex here for. If I'm not following please let me know! – Eduardo Ivanec Mar 22 '12 at 13:50
  • @SvenMarnach: you're right of course. I thought using it made sense because it acts as a generator (should cut memory usage by half) but it obviously doesn't, that speeds it up by a factor of 20! Thanks. – Eduardo Ivanec Mar 22 '12 at 13:59
0

How about this?

data = []
curRow = []
dataPerRow = 8
for row in FILE.readlines():
    for item in row.split():
         if len(curRow) == dataPerRow:
             data.append(curRow)
             curRow = []
         curRow.Append(item)

data.append(curRow)

(assuming FILE is the file being read in) You then have a list of lists, which can be used for whatever.

Jacob
  • 995
  • 4
  • 12