Trying to parse text files in python for data analysis

Question

I do a lot of data analysis in perl and I am trying to replicate this work in python using pandas, numpy, matplotlib, etc.

The general workflow goes as follows:

1) glob all the files in a directory

2) parse the files because they have metadata

3) use regex to isolate relevant lines in a given file (They usually begin with a tag such as 'LOOPS')

4) split the lines that match the tag and load data into hashes

5) do some data analysis

6) make some plots

Here is a sample of what I typically do in perl:

print"Reading File:\n";                              # gets data
foreach my $vol ($SmallV, $LargeV) {
  my $base_name = "${NF}flav_${vol}/BlockedWflow_low_${vol}_[0-9].[0-9]_-0.25_$Mass{$vol}.";
  my @files = <$base_name*>;                         # globs for file names
  foreach my $f (@files) {                           # loops through matching files
    print"... $f\n";
    my @split = split(/_/, $f);
    my $beta = $split[4];
    if (!grep{$_ eq $beta} @{$Beta{$vol}}) {         # constructs Beta hash
      push(@{$Beta{$vol}}, $split[4]);
    }
    open(IN, "<", "$f") or die "cannot open < $f: $!"; # reads in the file
    chomp(my @in = <IN>);
    close IN;
    my @lines = grep{$_=~/^LOOPS/} @in;       # greps for lines with the header LOOPS
    foreach my $l (@lines) {                  # loops through matched lines
      my @split = split(/\s+/, $l);           # splits matched lines
      push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash
      if (!grep{$_ eq $split[1]} @smearingt) {# fills the smearing time array
        push(@smearingt, $split[1]);
      }
      if (!grep{$_ eq $split[4]} @{$block{$vol}}) {# fills the number of blockings
        push(@{$block{$vol}}, $split[4]);
      }
    }
  }
  foreach my $beta (@{$Beta{$vol}}) {
    foreach my $loop (0,1,2,3,4) {         # loops over observables
      foreach my $b (@{$block{$vol}}) {    # beta values
        foreach my $t (@smearingt) {       # and smearing times
          $avg{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::avg(@{$val{$vol}{$beta}{$t}{$loop}{$b}});     # to find statistics
          $err{$vol}{$beta}{$t}{$loop}{$b} = stat_mod::stdev(@{$val{$vol}{$beta}{$t}{$loop}{$b}});
        }
      }
    }
  }
}
print"File Read in Complete!\n";

My hope is to load this data into a Hierarchical Indexed data structure with indices of the perl hash becoming indicies of my python data structure. Every example I have come across so far of pandas data structures has been highly contrived where the whole structure (indicies and values) was assigned manually in one command and then manipulated to demonstrate all the features of the data structure. Unfortunately I can not assign the data all at once because I don't know what mass, beta, sizes, etc are in the data that is going to be analyzed. Am I doing this the wrong way? Does anyone know a better way of doing this? The data files are immutable, I will have to parse through them using regex which I understand how to do. What I need help with is putting the data into an appropriate data structure so that I can take averages, standard deviations, perform mathematical operations, and plot the data.

Typical data has a header that is an unknown number of lines long but the stuff I care about looks like this:

Alpha 0.5 0.5 0.4
Alpha 0.5 0.5 0.4
LOOPS 0 0 0 2 0.5 1.7800178
LOOPS 0 1 0 2 0.5 0.84488326
LOOPS 0 2 0 2 0.5 0.98365135  
LOOPS 0 3 0 2 0.5 1.1638834
LOOPS 0 4 0 2 0.5 1.0438407
LOOPS 0 5 0 2 0.5 0.19081102
POLYA NHYP 0 2 0.5 -0.0200002 0.119196 -0.0788721 -0.170488 
BLOCKING COMPLETED
Blocking time 1.474 seconds
WFLOW 0.01 1.57689 2.30146 0.000230146 0.000230146 0.00170773 -0.0336667
WFLOW 0.02 1.66552 2.28275 0.000913101 0.00136591 0.00640552 -0.0271222
WFLOW 0.03 1.75 2.25841 0.00203257 0.00335839 0.0135 -0.0205722
WFLOW 0.04 1.83017 2.22891 0.00356625 0.00613473 0.0224607 -0.0141664
WFLOW 0.05 1.90594 2.19478 0.00548695 0.00960351 0.0328218 -0.00803792
WFLOW 0.06 1.9773 2.15659 0.00776372 0.0136606 0.0441807 -0.00229793
WFLOW 0.07 2.0443 2.1149 0.010363 0.018195 0.0561953 0.00296648

What I (think) I want, I preface this with think because I am new to python and an expert may know a better data structure, is a Hierarchical Indexed Series that would look like this:

volume   mass   beta   observable   t   value

1224     0.0    5.6    0            0   1.234
                                    1   1.490
                                    2   1.222
                       1            0   1.234
                                    1   1.234
2448     0.0    5.7    0            1   1.234

and so on like this: http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-hierarchical

For those of you who don't understand the perl:

The meat and potatoes of what I need is this:

push(@{$val{$vol}{$beta}{$split[1]}{$split[2]}{$split[4]}}, $split[6]);# reads data into hash

What I have here is a hash called 'val'. This is a hash of arrays. I believe in python speak this would be a dict of lists. Here each thing that looks like this: '{$something}' is a key in the hash 'val' and I am appending the value stored in the variable $split[6] to the end of the array that is the hash element specified by all 5 keys. This is the fundamental issue with my data is there are a lot of keys for each quantity that I am interested in.

==========

UPDATE

I have come up with the following code which results in this error:

Traceback (most recent call last):
  File "wflow_2lattice_matching.py", line 39, in <module>
    index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
NameError: name 'MultiIndex' is not defined

Code:

#!/usr/bin/python

from pandas import Series, DataFrame
import pandas as pd
import glob
import re
import numpy

flavor = 4
mass = 0.0

vol = []
b = []
m_t = []
w_t = []
val = []

#tup_vol = (1224, 1632, 2448)
tup_vol = 1224, 1632
for v in tup_vol:
  filelist = glob.glob(str(flavor)+'flav_'+str(v)+'/BlockedWflow_low_'+str(v)+'_*_0.0.*')
  for filename in filelist:
    print 'Reading filename:  '+filename
    f = open(filename, 'r')
    junk, start, vv, beta, junk, mass, mont_t = re.split('_', filename)
    ftext = f.readlines()
    for line in ftext:
      if re.match('^WFLOW.*', line):
        line=line.strip()
        junk, smear_t, junk, junk, wilson_flow, junk, junk, junk = re.split('\s+', line)
        vol.append(v)
        b.append(beta)
        m_t.append(mont_t)
        w_t.append(smear_t)
        val.append(wilson_flow)
zipped = zip(vol, beta, m_t, w_t)
index = MultiIndex.from_tuples(zipped, names=['volume', 'beta', 'montecarlo_time, smearing_time'])
data = Series(val, index=index)

Can you give an example of the data format you're dealing with? — BrenBarn, Nov 13 '12 at 02:22
I added sample data in an edit because it wouldn't fit in a comment. — deltap, Nov 13 '12 at 02:32
Okay, and what is it you want? Unfortunately your perl code is not that much help to a Pythonista like me. Given the sample data that you have, what sort of relationships are you hoping to extract, and what computations do you want to do? I assume the numbers that constitute your actual data are those contained in the lines beginning with "LOOPS" and "WFLOW", but how are they connected? Are those supposed to be two columns of a resulting data structure, or what? — BrenBarn, Nov 13 '12 at 02:50
Yes, I just made the necessary edit. I'm new to the site as well...sorry for the junk post in the comments before. — deltap, Nov 13 '12 at 03:02
I'm having a heck of time figuring out how you got that result from that data. Is that the data *after* you've already computed your averages and so on? If so, it might be good to back up a bit. You have some sort of format in your file that's not obviously tabular. You showed the final output, but are you at some intermediate point creating a tabular structure from that? Pandas deals with tabular data. If your data can be massaged into tabular form, then you can calculate your averages within Pandas. Can you provide a verbal description of the computations you want to do on your data? — BrenBarn, Nov 13 '12 at 03:08
Your description still makes reference to terms like "beta" that do not appear in your sample data. Can you give a description something like "I take all the lines that start with LOOP, and for each one I create an array, and for each of those arrays. . ."? I still can't understand from your post what the relationship is among the various numbers in your sample data. — BrenBarn, Nov 13 '12 at 03:19
@BrenBarn The data that I showed there is an input file. You are right it is not tabular. Each file contains metadata information in the file name such as the mass and beta which are keys in my hash. Then I look for lines in the file that start with LOOPS for instance. These lines add additional keys to the hash, in this case three of the entries on the line end up as keys and one of the elements ends up as the value. Once I get all the files read in I find the average and standard deviation of each hash element (remember each hash element is an array of numbers) and store this in a hash. — deltap, Nov 13 '12 at 03:29
I'm getting the impression that the example output you gave, then, doesn't correspond to the input you gave (because if the sample input is all from one file then it would only have one beta, but your example output shows more than one beta). Can you give example input and the desired output that would come *from that input alone*? — BrenBarn, Nov 13 '12 at 03:48
@BrenBarn There are hundreds of files. Some files have the same mass, beta, volume, others have different mass, beta, volumes. The mass, volume, beta are all run parameters. Each line that starts with LOOPS corresponds to a measurement. The hash keys coming from the file name have information about the run. The hash keys coming from the lines inside the files have information about what was being measured. — deltap, Nov 13 '12 at 04:30
@DeltaP: Okay, but you will have to provide an actual example in order for me (and probably others) to understand how you're parsing the data. Can you come up with a toy example with inputs and corresponding outputs that illustrate the process? — BrenBarn, Nov 13 '12 at 04:49

score 2 · Answer 1 · answered Nov 15 '12 at 04:45

You are getting the following:

NameError: name 'MultiIndex' is not defined

because you are not importing MultiIndex directly when you import Series and DataFrame.

You have -

from pandas import Series, DataFrame

You need -

from pandas import Series, DataFrame, MultiIndex

or you can instead refer to MultiIndex using pd.MultiIndex since you are importing pandas as pd

score 1 · Answer 2 · answered Nov 13 '12 at 02:37

1

Hopefully this helps you get started?

import sys, os

def regex_match(line) :
  return 'LOOPS' in line

my_hash = {}

for fd in os.listdir(sys.argv[1]) :           # for each file in this directory 
  for line in open(sys.argv[1] + '/' + fd) :  # get each line of the file
    if regex_match(line) :                    # if its a line I want
      line.rstrip('\n').split('\t')           # get the data I want
      my_hash[line[1]] = line[2]              # store the data

for key in my_hash : # data science can go here?
  do_something(key, my_hash[key] * 12)

# plots

p.s. make the first line

#!/usr/bin/python

(or whatever which python returns ) to run as an executable

answered Nov 13 '12 at 02:37

Austin

1,122
2
10
27

Thank you for the response. This is reading the data into a regular hash. Is there a way to read it into the data structure I want? – deltap Nov 13 '12 at 02:51
@DeltaP: What is the data structure you want? – BrenBarn Nov 13 '12 at 02:53
@BrenBarn, He is interested in this tool: http://pandas.pydata.org/, but I don't know enough perl to readily parse out what he wants. DeltaP, can you describe the data structure more precisely in English? Whats wrong with just casting the python has to a pandas structure – Austin Nov 13 '12 at 02:57
@Austin Is casting the python hash into a pandas structure the standard way of populating a has with data parsed from text files? If it is then I will attempt that. I was hoping there would be a direct way to enter the data to the pandas structure as I parse the files. – deltap Nov 13 '12 at 03:04
@DeltaP: It depends on the format of your data. In pandas, it is generally inefficient to gradually add one row at a time to a data structure. In general, in pandas you either read your data directly from a tabular format (e.g., CSV) or you populate it from a more basic Python type like a list or dict. – BrenBarn Nov 13 '12 at 03:13
@Austin I have added what I hope is a reasonable translation of the relevant perl for people who are not familiar with it. – deltap Nov 13 '12 at 03:22
@BrenBarn Do you know how to handle the fact there are so many keys in the construction of the eventual data structure? Should I make a list for each key and all the values and then form a hierarchical indexed series out of those lists and values? – deltap Nov 13 '12 at 03:22
@DeltaP: The problem is that I don't understand how those keys are being derived from the sample data you have provided. It's fine to have many keys to select the data, but I don't get where those keys are coming from. – BrenBarn Nov 13 '12 at 03:24
@BrenBarn some of the keys are in the file name, some of the keys are coming off the line that was matched because it started with LOOPS. – deltap Nov 13 '12 at 03:32

score 1 · Answer 3 · edited May 23 '17 at 11:56

1

To glob your files, use the built-in glob module in Python.

To read your csv files after globbing them, the read_csv function that you can import using from pandas.io.parsers import read_csv will help you do that.

As for MultiIndex feature in the pandas dataframe that you instantiate after using read_csv, you can then use them to organize your data and slice them anyway you want.

3 pertinent links for your reference.

Understanding MultiIndex dataframes in pandas - understanding MultiIndex and Benefits of panda's multiindex?
Using glob in a directory to grab and manipulate your files - extract values/renaming filename in python

edited May 23 '17 at 11:56

Community

1
1

answered Nov 13 '12 at 11:02

Calvin Cheng

35,640
39
116
167

My data is read from a plain text file not a csv. You seem to know how this works and maybe could explain what I am doing wrong in the snippet below the Update in my post. Thanks. – deltap Nov 13 '12 at 21:10
path.py is even easier to use than glob. – Paulo Scardine Nov 15 '12 at 04:52

Trying to parse text files in python for data analysis

UPDATE

3 Answers3

Linked

Related