Extracting information from unconventional text files? (Python)

Question

I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:

#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)

#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)

#PHASE = 40
...

And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.

My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function

def makeTable(a,b,c):
   output = Table()
   output['x'] = a
   output['y'] = b
   output['z'] = c
   return output

Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code

fileName_phase = makeTable(a,b,c)

Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.

Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.

This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?

To avoid such downvotes in the future make sure you share the code you've tried and explain why it's not working. Otherwise it's hard to help and above all sounds like you haven't done anything and just want us to write code for you. Please share your code and I'll remove the DV and try to help, I'm sure once we see your code we'll be able to tell you how to modify it to avoid near 100 times duplicate. — Julien, Mar 09 '17 at 05:32
Sorry for the snappy response to your reply. Maybe I'm just not used to communicating with coders about code. I'm a physics student, I've only taken one class on coding in the last 10 years... Is this more clear? — Izzy, Mar 09 '17 at 05:59
Is it np.array([1,2,..]) instead of np.array(1,2, ....) because that makes more sense. — Ajay Brahmakshatriya, Mar 09 '17 at 06:41

score 1 · Answer 1 · answered Mar 09 '17 at 04:39

1

If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.

P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.

answered Mar 09 '17 at 04:39

Mikael

482
8
23

My (superficial, non-majors) class on Python coding didn't include much discussion of classes. Can you point me towards a good resource so I can teach myself? – Izzy Mar 09 '17 at 06:05
Maybe this will help: http://stackoverflow.com/questions/3724110/practical-example-of-polymorphism ... If I understand the thoughts in this answer, he is proposing creating a subclass for each slightly different case; then the subclass contains the special case and the superclass handles the general flow and parsing. – tripleee Mar 09 '17 at 06:10

Ajay Brahmakshatriya · Answer 2 · 2017-03-09T06:38:30.970

I will suggest a way which will be scorned by many but will get your work done.

So apologies to every one.

The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).

So the key point here is that the text in the file is code which means it can be executed.

So you can do something like this

import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()

Now that you have all the content, you have to separate it by phase. For this we will use the re.split function.

phase_data = re.split("#PHASE = .*\n", content)

Now we have the content of each phase in an array.

Now comes for the part of executing it.

for phase in phase_data:
    if len(phase.strip()) == 0:
        continue
    exec(phase)
    table = makeTable(x, y, z) # the x, y and z are defined by the exec. 
    # do whatever you want with the table.

I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.

But your work seems like a scripting one and I believe this will get your work done.

PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.

Julien · Answer 3 · 2017-03-10T03:47:50.457

To avoid the safety issue of using exec as suggested by @Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:

VARS = 'xyz'
def makeTable(phase):
    assert len(phase) >= 3
    output = Table()
    for i in range(3):
        line = [s.strip() for s in phase[i].split('=')]
        assert len(line) == 2
        var, arr = line
        assert var == VARS[i]
        assert arr[:10]=='np.array([' and arr[-2:]=='])'
        output[var] = np.fromstring(arr[10:-2], sep=',')
    return output

and then call

table = makeTable(phase)

instead of

exec(phase)
table = makeTable(x, y, z)

You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...

Extracting information from unconventional text files? (Python)

3 Answers3