Get a table from a print output (pandas)

Question

I ran a programme called codeml implemented in the python package ete3.

Here is the print of the model generated by codeml :

>>> print(model)
 Evolutionary Model fb.cluster_03502:
        log likelihood       : -35570.938479
        number of parameters : 23
        sites inference      : None
        sites classes        : None
        branches             : 
        mark: #0  , omega: None      , node_ids: 8   , name: ROOT
        mark: #1  , omega: 789.5325  , node_ids: 9   , name: EDGE
        mark: #2  , omega: 0.005     , node_ids: 4   , name: Sp1
        mark: #3  , omega: 0.0109    , node_ids: 6   , name: Seq1
        mark: #4  , omega: 0.0064    , node_ids: 5   , name: Sp2
        mark: #5  , omega: 865.5116  , node_ids: 10  , name: EDGE
        mark: #6  , omega: 0.005     , node_ids: 7   , name: Seq2
        mark: #7  , omega: 0.0038    , node_ids: 11  , name: EDGE
        mark: #8  , omega: 0.067     , node_ids: 2   , name: Sp3
        mark: #9  , omega: 999.0     , node_ids: 12  , name: EDGE
        mark: #10 , omega: 0.1165    , node_ids: 3   , name: Sp4
        mark: #11 , omega: 0.1178    , node_ids: 1   , name: Sp5

But since it is only a print, I would need to get these informations into a table such as :

Omega       node_ids       name 
None        8              ROOT
789.5325    9              EDGE
0.005       4              Sp1
0.0109      6              Seq1
0.0064      5              Sp2
865.5116    10             EDGE
0.005       7              Sp3
0.0038      11             EDGE
0.067       2              Sp3
999.0       12             EDGE
0.1165      3              Sp4
0.1178      1              Sp5

Because I need to parse these informations.

Do you have an idea how to handle a print output ?

Thanks for your help.

Probably best way to go is save the output in a temp file (considering the size), then parse it? — , Nov 05 '19 at 09:08

Jonathan Scholbach · Answer 1 · 2019-11-05T09:21:17.423

There are two problems with implicit assumptions in your question:

Why print?

Why do you print the model in the first place? This is not a good way to access internals of the model programmatically, because this is made for being read by humans, and you cannot be sure whether maybe some information of the model is omitted in its __str__() method which is used for printing. You have to find out how the Evolutionary Model is structured, turn this structure into a dictionary and create a dataframe from this dictionary, using pandas.DataFrame.from_dict, I would say.

Start with taking a look at model.__dict__() and model.__repr__().

If you can have a look at the code that defines Evolutionary Model, you can of course look up the structure of Evolutionary Model directly and turn it into a dictionary.

Why dataframe?

If you just want to "parse" the model, so if you just want to gain programmatic access to its attributes, it is a lot of extra work to put this into a dataframe. Just access the attributes directly, for instance model.branches if you want to get the value of the branches attribute of the model.

Ok I see, thank you for your time it helped a lot :) – chippycentra Nov 05 '19 at 09:41 — chippycentra, Nov 05 '19 at 09:41

Georg M. · Accepted Answer · 2019-11-05T10:00:03.037

2

I took a look at the underlying code in model.py

It seems that you can use s = model.__str__() to obtain a string of this print-out. From there you can parse the string using standard string operations. I don't know the exact form of your string, but your code could look something like this:

import pandas as pd

lines = s.split('\\n')

lst = []
first_idx = 6  # Skip the lines that are not of interest.
names = [field[:field.index(':')].strip() for field in lines[first_idx].split(',')]

for line in lines[first_idx:]:  
    if line:
        row = [field[field.index(':')+1:].strip().strip("#") for field in line.split(',')]
        lst.append(row)

df = pd.DataFrame(lst, columns=names)

There are prettier ways to do this, but it gets the job done.

edited Nov 05 '19 at 10:00

answered Nov 05 '19 at 09:14

Georg M.

58
6

It worked like a charm thank you !:) (ps : it was lines = s.split('\n')p – chippycentra Nov 05 '19 at 09:40
Humm, I get a small issue here : the names are quite longer than in the exemple, and it seems that the method delete a part of the name, for instance I had a name in the mode : Platygaster_orseoliae and I get Platygaster_orseol in the data frame generated... – chippycentra Nov 05 '19 at 09:46
I will clean it up a bit. :) – Georg M. Nov 05 '19 at 09:53

Lambda · Answer 3 · 2019-11-05T09:33:25.867

1

You can use StringIO and applymap

from io import StringIO
import pandas as pd

df = pd.read_csv(StringIO(model.__repr__()), skiprows=6, names=['mark', 'omega', 'node_ids', 'name'])
df = df.applymap(lambda x: x.split(":")[1])

Output:

    mark    omega       node_ids    name
0   #0      None        8           ROOT
1   #1      789.5325    9           EDGE
2   #2      0.005       4           Sp1
3   #3      0.0109      6           Seq1
4   #4      0.0064      5           Sp2
5   #5      865.5116    10          EDGE
6   #6      0.005       7           Seq2
7   #7      0.0038      11          EDGE
8   #8      0.067       2           Sp3
9   #9      999.0       12          EDGE
10  #10     0.1165      3           Sp4
11  #11     0.1178      1           Sp5

edited Nov 05 '19 at 09:33

answered Nov 05 '19 at 09:22

Lambda

1,392
1
9
11

I get the error : Traceback (most recent call last): File "", line 1, in TypeError: initial_value must be str or None, not Model – chippycentra Nov 05 '19 at 09:25
I think you can consider @jonathan.scholbach's answer, change my code `...StringIO(model)...` to `...StringIO(model.__repr__())...` – Lambda Nov 05 '19 at 09:29

Get a table from a print output (pandas)

3 Answers3

Why print?

Why dataframe?