4

I'm new to python and I'm trying to figure out how to load a data file that contains blocks of data on a per timestep basis, such as like this:

TIME:,0
Q01 : A:,-10.7436,0.000536907,-0.00963283,0.00102934
Q02 : B:,0,0.0168694,-0.000413983,0.00345921
Q03 : C:,0.0566665
Q04 : D:,0.074456
Q05 : E:,0.077456
Q06 : F:,0.0744835
Q07 : G:,0.140448
Q08 : H:,-0.123968
Q09 : I:,0
Q10 : J:,0.00204377,0.0109621,-0.0539183,0.000708574
Q11 : K:,-2.86115e-17,0.00947104,0.0145645,1.05458e-16,-1.90972e-17,-0.00947859
Q12 : L:,-0.0036781,0.00161254
Q13 : M:,-0.00941257,0.000249692,-0.0046302,-0.00162387,0.000981709,-0.0135982,-0.0223496,-0.00872062,0.00548815,0.0114075,.........,-0.00196206
Q14 : N:,3797, 66558
Q15 : O:,0.0579981
Q16 : P:,0
Q17 : Q:,625

TIME:,0.1
Q01 : A:,-10.563,0.000636907,-0.00963283,0.00102934
Q02 : B:,0,0.01665694
Q03 : C:,0.786,-0.000666,0.6555
Q04 : D:,0.87,0.96
Q05 : E:,0.077456
Q06 : F:,0.07447835
Q07 : G:,0.140448
Q08 : H:,-0.123968
Q09 : I:,0
Q10 : J:,0.00204377,0.0109621,-0.0539183,0.000708574
Q11 : K:,-2.86115e-17,0.00947104,0.0145645,1.05458e-16,-1.90972e-17,-0.00947859
Q12 : L:,-0.0036781,0.00161254
Q13 : M:,-0.00941257,0.000249692,-0.0046302,-0.00162387,0.000981709,-0.0135982,-0.0223496,-0.00872062,0.00548815,0.0114075,.........,-0.00196206
Q14 : N:,3797, 66558
Q15 : O:,0.0579981
Q16 : P:,0,2,4
Q17 : Q:,786

Each block contains a number of variables that may have very different numbers of columns of data in it. The number of columns per variable may change in each timestep block, but the number of variables per block is the same in every timestep and it is always known how many variables were exported. There is no information on the number of blocks of data (timesteps) in the data file.

When the data has been read, it should be loaded in a format of variable per timestep:

Time:  |  A:                                           |  B:
0      |  -10.7436,0.000536907,-0.00963283,0.00102934  |  ........
0.1    |  -10.563,0.000636907,-0.00963283,0.00102934   |  ........
0.2    |  ......                                       |  ........

If the number of columns of data was the same every timestep and the same for every variable , this would be a very simple problem.

I guess I need to read the file line by line, in two loops, one per block and then once inside each block and then store the inputs in an array (append?). The changing number of columns per line has me a little stumped at the minute since I'm not very familiar with python and numpy yet.

If someone could point me in the right direction, such as what functions I should be using to do this relatively efficiently, that would be great.

jpmorr
  • 500
  • 5
  • 25
  • take a look here: http://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python – dot.Py May 20 '16 at 19:52
  • @Dot_Py Thanks, but that solution seems to depend on knowing the maximum number of columns, which I have no idea of. The last answer there had a loop, but didn't seem to work. – jpmorr May 20 '16 at 19:57
  • Oh wait... you dont know the number of columns? I think this is not a good practice... So you should re-think/re-design your database scheme to add more rows if necessary... not columns. – dot.Py May 20 '16 at 20:02
  • but you can loop through the rows and count the number of "," occurrencies to get the maximum number of columns (`num_max_cols`)in your whole .txt, then you can name your columns using "`range(num_max_cols)`" – dot.Py May 20 '16 at 20:04
  • @Dot_Py I would love to change the output, but it's coming from a simulation solver that is spitting out values of number of items in distinct zones. Since things are moving, the numbers change on a per timestep basis. – jpmorr May 20 '16 at 20:05
  • 1
    I disagree with what @Dot_Py is saying, there's no problem with not knowing the columns that are going to be present in the input. I think the answer to your matter is in populating a dictionary object. I've wrote an answer showing how I'd do it – Lucas Franceschi May 20 '16 at 20:17

4 Answers4

2

File test.csv:

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4

Handling data:

my_cols = ["A", "B", "C", "D", "E"]
pd.read_csv("test.csv", names=my_cols, engine='python')

Output:

   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

Or you can use names parameter.

For example:

1,2,1
2,3,4,2,3
1,2,3,3
1,2,3,4,5,6

If you read it, you'll receive the following error:

>>> pd.read_csv(r'D:/Temp/test.csv')
Traceback (most recent call last):
...
Expected 5 fields in line 4, saw 6

But if you pass names parameters, you'll get result:

>>> pd.read_csv(r'D:/Temp/test.csv', names=list('ABCDEF'))

Output:

   A  B  C   D   E   F
0  1  2  1 NaN NaN NaN
1  2  3  4   2   3 NaN
2  1  2  3   3 NaN NaN
3  1  2  3   4   5   6

Hope it helps.

dot.Py
  • 5,007
  • 5
  • 31
  • 52
  • 1
    Thanks again, but this doesn't quite work for my data - I will never know how many columns of data will exist per block, all I will know is how many variables (lines) are in each timestep block. – jpmorr May 20 '16 at 20:02
2

A very non-polished way to accomplish that is by reading your text file and creating a dict structure as you sweep through. Here's an example that might achieve your goal (based on the input you've provided):

time = 0
output = {}
with open('path_to_file','r') as input_file:
    for line in input_file:
        line = line.strip('\n')
        if 'TIME' in line:
            time = line.split(',')[1]
            output[time] = {}
        else:
            col_name = line.split(':')[1].strip()
            col_value = line.split(':')[2].strip(',') 
            output[time][col_name] = col_value

this will deliver a output object which is a dictionary with the following structure:

output = {
'0': {'A': '-10.7436,0.000536907,-0.00963283,0.00102934',
      'B': '0,0.0168694,-0.000413983,0.00345921',
      ...
      'Q': '625'},

'0.1': {'A': '-10.563,0.000636907,-0.00963283,0.00102934',
        'B': '0,0.01665694',
        ...
        'Q': '786'}
}

Which I think matches what you are looking for. To access one value inside this dictionary you should use value = output['0.1']['A'] which would yield '-10.563,0.000636907,-0.00963283,0.00102934'

Lucas Franceschi
  • 398
  • 6
  • 12
  • Yes, this is mostly working as I expected. There's a silly \n token stuck in the time dict but than can be sorted out. I hadn't though about dictionaries since I will be doing lots of array operations, but I assume I can simply convert a call to the dictionary to a numpy array if required. – jpmorr May 20 '16 at 20:22
  • Ah, yes. I've edited the answer to include a line that sorts this problem out. Notice that in the last line of the code I've suggested you know exactly where you are (time, column name and column value). You could easily change the last line to put this data in any other structure if you want. – Lucas Franceschi May 20 '16 at 20:28
  • Hmmm, I'm getting this error when I try it on a data file: Traceback (most recent call last): File "", line 14, in res[TIME][parts[1].lstrip()] = parts[2].strip(',') IndexError: list index out of range. I need some time to check out why but thanks for the help so far! – jpmorr May 20 '16 at 20:33
2
import pandas as pd
res = {}
TIME = None

# by default lazy line read
for line in open('file.txt'):
    parts = line.strip().split(':')
    map(str.strip, parts)
    if len(parts) and parts[0] == 'TIME':
        TIME = parts[1].strip(',')
        res[TIME] = {}
        print('New time section start {}'.format(TIME))
        # here you can stop and work with data from previou period
        continue

    if len(parts) <= 1:
        continue
    res[TIME][parts[1].lstrip()] = parts[2].strip(',').split(',')

df = pd.DataFrame.from_dict(res, 'columns')
# for example for TIME 0
dfZero = df['0']
print(dfZero)


df = pd.DataFrame.from_dict(res, 'index')

dfA = df['A']
print(dfA)

enter image description here

VelikiiNehochuha
  • 3,775
  • 2
  • 15
  • 32
  • Thanks, This seems to be working quite nicely without any errors. I need to test it with some proper data files to see how it holds up. – jpmorr May 20 '16 at 20:37
  • What if I want to access variable A for all timesteps? The data frame only seems to let me access all variables for a single timestep? – jpmorr May 20 '16 at 20:47
  • @jpmorr you can just channge df = pd.DataFrame.from_dict(res, 'index') or df = pd.DataFrame.from_dict(res, 'columns'), please see example. – VelikiiNehochuha May 20 '16 at 21:06
1

This reader is similar to @Lucas's - each block is a dictionary saved in meta dictionary keyed by time. It could have been a list instead.

blocks = {}
with open('stack37354745.txt') as f:
    for line in f:
        line = line.strip()
        if len(line)==0: continue  # blank line
        d = line.split(':')
        if len(d)==2 and d[0]=='TIME':  # new block
            time = float(d[1].strip(','))
            blocks[time] = data = {}
        else:
           key = d[1].strip()  # e.g. A, B, C
           value = d[2].strip(',').split(',')
           value = np.array(value, dtype=float) # assume valid numeric list
           data[key] = value

Values can be fetched, displayed and reorganized with iterations like:

for time in blocks:
    b = blocks[time]
    print('TIME: %s'%time)
    for k in b:
        print('%4s: %s'%(k,b[k]))

produces:

TIME: 0.0
   C: [ 0.0566665]
   G: [ 0.140448]
   A: [ -1.07436000e+01   5.36907000e-04  -9.63283000e-03   1.02934000e-03]
   ...
   K: [ -2.86115000e-17   9.47104000e-03   1.45645000e-02   1.05458000e-16
  -1.90972000e-17  -9.47859000e-03]

TIME: 0.1
   C: [  7.86000000e-01  -6.66000000e-04   6.55500000e-01]
   G: [ 0.140448]
   A: [ -1.05630000e+01   6.36907000e-04  -9.63283000e-03   1.02934000e-03]
   ...
   K: [ -2.86115000e-17   9.47104000e-03   1.45645000e-02   1.05458000e-16
  -1.90972000e-17  -9.47859000e-03]

(I removed the .... from one of the data lines)

Or in a quasi table format

fmt = '%10s | %s | %s | %s'
print(fmt%('Time','B','D','E'))
for time in blocks:
    b =  blocks[time]
    # print(list(b.keys()))
    print(fmt%(time, b['B'], b['D'],b['E']))

producing:

      Time | B | D | E
       0.0 | [ 0.          0.0168694  -0.00041398  0.00345921] | [ 0.074456] | [ 0.077456]
       0.1 | [ 0.          0.01665694] | [ 0.87  0.96] | [ 0.077456]

Since variables like B can have different lengths, it is hard to collect values across time as some sort of 2d array.

In general it is easiest to focus first on loading the file into some sort of Python structure. That kind of action almost has to be written in Python, iterating line by line (unless you let pandas do it for you).

Once that's done you can reorganize the date in many different ways to suit your needs. With something this variable it doesn't make sense to aim for rectangular numpy arrays right a way.

hpaulj
  • 221,503
  • 14
  • 230
  • 353