Python: Extracting floats from files in a complex directory tree - Are loops the answer?

Question

I have just started doing my first research project, and I have just begun programming (approximately 2 weeks ago). Excuse me if my questions are naive. I might be using python very inefficiently. I am eager to improve here.

I have experimental data that I want to analyse. My goal is to create a python script that takes the data as input, and that for output gives me graphs, where certain parameters contained in text files (within the experimental data folders) are plotted and fitted to certain equations. This script should be as generalizable as possible so that I can use it for other experiments.

I'm using the Anaconda, Python 2.7, package, which means I have access to various libraries/modules related to science and mathematics.

I am stuck at trying to use For and While loops (for the first time).

The data files are structured like this (I am using regex brackets here):

.../data/B_foo[1-7]/[1-6]/D_foo/E_foo/text.txt

What I want to do is to cycle through all the 7 top directories and each of their 6 subdirectories (named 1,2,3...6). Furthermore, within these 6 subdirectories, a text file can be found (always with the same filename, text.txt), which contain the data I want to access.

The 'text.txt' files is structured something like this:

 1     91.146    4.571   0.064   1.393   939.134     14.765

 2     88.171    5.760   0.454   0.029   25227.999   137.883

 3     88.231    4.919   0.232   0.026   34994.013   247.058

 4      ...       ...     ...     ...      ...         ...

The table continues down. Every other row is empty. I want to extract information from 13 rows starting from the 8th line, and I'm only interested in the 2nd, 3rd and 5th columns. I want to put them into lists 'parameter_a' and 'parameter_b' and 'parameter_c', respectively. I want to do this from each of these 'text.txt' files (of which there is a total of 7*6 = 42), and append them to three large lists (each with a total of 7*6*13 = 546 items when everything is done).

This is my attempt:

First, I made a list, 'list_B_foo', containing the seven different 'B_foo' directories (this part of the script is not shown). Then I made this:

parameter_a = []
parameter_b = []
parameter_c = []
j = 7 # The script starts reading 'text.txt' after the j:th line.
k = 35 # The script stops reading 'text.txt' after the k:th line.
x = 0
while x < 7:
    for i in range(1, 7):
        path = str(list_B_foo[x]) + '/%s/D_foo/E_foo/text.txt' % i
        m = open(path, 'r')
        line = m.readlines()
        while j < k:
            line = line[j]
            info = line.split()
            print 'info:', info
            parameter_a.append(float(info[1]))
            parameter_b.append(float(info[2]))
            parameter_c.append(float(info[5]))
            j = j + 2
    x = x + 1

parameter_a_vect = np.array(parameter_a)
parameter_b_vect = np.array(parameter_b)
parameter_c_vect = np.array(parameter_c)

print 'a_vect:', parameter_a_vect
print 'b_vect:', parameter_b_vect
print 'c_vect:', parameter_c_vect

I have tried to fiddle around with indentation without getting it to work (receiving either syntax error or indentation errors). Currently, I get this output:

info: ['1', '90.647', '4.349', '0.252', '0.033', '93067.188', '196.142']
info: ['.']
Traceback (most recent call last):
  File "script.py", line 104, in <module>
    parameter_a.append(float(info[1]))
IndexError: list index out of range

I don't understand why I get the "list index out of range" message. If anyone knows why this is the case, I would be happy to hear you out.

How do I solve this problem? Is my approach completely wrong?

EDIT: I went for a pure while-loop solution, taking RebelWithoutAPulse and CamJohnson26's suggestions into account. This is how I solved it:

parameter_a=[]
parameter_b=[]
parameter_c=[] 
k=35 # The script stops reading 'text.txt' after the k:th line.
x=0
while x < 7:
    y=1
    while y < 7:
        j=7 
        path1 = str(list_B_foo[x]) + '/%s/pdata/999/dcon2dpeaks.txt' % (y)
        m = open(path, 'r')
        lines = m.readlines()
        while j < k:
            line = lines[j]
            info = line.split()
            parameter_a.append(float(info[1]))
            parameter_b.append(float(info[2]))
            parameter_c.append(float(info[5]))
            j = j+2
        y = y+1 
    x = x+1

Meta: I am not sure If I should give the answer to the person who answered the quickest and who helped me finish my task. Or the person with the answer which I learned most from. I am sure this is a common issue that I can find an answer to by reading the rules or going to Stackexchange Meta. Until I've read up on the recomendations, I will hold off on marking the question as answered by any of you two.

score 1 · Answer 1 · answered Jun 20 '16 at 20:05

Looks like you are overwriting the line array with the first line of the file. You call line = m.readlines(), which sets line equal to an array of lines. You then set line = line[j], so now the line variable is no longer an array, it's a string equal to

1     91.146    4.571   0.064   1.393   939.134     14.765

This loop works fine, but the next loop will treat line as an array of chars and take the 4th element, which is just a period, and set it equal to itself. That explains why the info variable only has one element on the second pass through the loop.

To solve this, just use 2 line variables instead of one. Call one lines and the other line.

    lines = m.readlines()
    while j < k:
        line = lines[j]
        info = line.split()

May be other errors too but that should get you started.

Thank you, this helped. Now I get no errors. Strangely enough, the script only seems to go through the first 'text.txt' (at .../data/B_foo1/1/D_foo/E_foo/text.txt), and if i print the lengths of the lists/arrays like this: print 'Length of c_vect:', len(parameter_c_vect) I get this: Length of c_vect: 13 I will look at this more tomorrow. Thank you for your help! Very clear answer. — Lucubrator, Jun 20 '16 at 20:20
Yeah easy fix, just need to reset your j variable on each pass. move the j=7 line to directly below the while statement. Glad to help, if you could mark as answer I'd appreciate it! — CamJohnson26, Jun 20 '16 at 20:24

score 1 · Accepted Answer · edited May 23 '17 at 12:10

Welcome to stack overflow!

The error is due to name collision that you inadvertenly have created. Note the output before the exception occurs:

info: ['1', '90.647', '4.349', '0.252', '0.033', '93067.188', '196.142']
info: ['.']
Traceback (most recent call last):
...

The line[1] cannot compute - there is no "1"-st element in the list, containing only '.' - in python the lists start with 0 position.

This happens in your nested loop,

while j < k

where you redefine the very line you read previously created:

line = m.readlines()
    while j < k:
        line = line[j]
        info = line.split()
        ...

So what happens is on first run of the loop, your read the lines of the files into line list, then you take one line from the list, assign it to line again, and continue with the loop. At this point line contains a string.

On the next run reading from line via specified index reads the character from the string on the j-th position and the code malfunctions.

You could fix this with different naming.

P.S. I would suggest using with ... as ... syntax while working with files, it is briefly described here - this is called a context manager and it takes care of opening and closing the files for you.

P.P.S. I would also suggest reading the naming conventions

Thank you for your help and for really taking you time to answer. Much appreciated. Very good suggestions, taking into account my level of expertise and knowledge. I have taken a quick look at the naming conventions, learning them seems to be a good choice and I will definitely go back to them when I have the time. PS: Read my edit in the original question. — Lucubrator, Jun 21 '16 at 08:44

Python: Extracting floats from files in a complex directory tree - Are loops the answer?

2 Answers2