Parse prettyprinted tabular data with pandas

Question

What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard() cannot manage this task on its own.

Example 1:

| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |

Expected result:

 Age Category  A  B   C   D    
 21-26         2  2   4   1 
 26-31         7  11  12  5 
 31-36         3  5   5   2 
 36-41         2  4   1   7 
 41-46         0  1   3   2 
 46-51         0  0   2   3

EDIT:

Example 2:

+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

Expected result:

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

I look for a universal approach that can be applied to the most common table types.

Does this help? [how to copy paste df from so into python](https://stackoverflow.com/questions/31610889/how-to-copy-paste-dataframe-from-stack-overflow-into-python) — ldz, Dec 06 '19 at 10:52
I've seen others use `StringIO` but I'm also curious. Thanks for the good question — Umar.H, Dec 06 '19 at 11:31
@Datanovice `pd.read_clipboard('|').dropna(how='all',axis=1).iloc[1:].reset_index(drop=True)` ? — anky, Dec 09 '19 at 15:48
thanks @anky_91 the code isn't really an issue, I was wondering if there was a better method for dealing with String tables such as this without the need for writing code as above, but as jorijnsmit pointed out it's more for a visual representation. — Umar.H, Dec 09 '19 at 16:06

gosuto · Answer 1 · 2019-12-09T17:48:41.810

The reason this is so complicated is that these type of ASCII tables or not really designed with data transfer in mind. Their true function is to depict the data in a visually pleasing way.

This doesn't mean it is not possible to use it to transfer into pandas! Let's start with .read_clipboard():

df = pd.read_clipboard(sep='|').iloc[1:,1:-1]

Instead of using a comma as the (default) separator we define | to be the separator.

The .iloc[1:,1:-1] gets rid of the first row (-----------) and the first and last columns: because of the trailing | at the beginning and end of each line pandas sees an 'empty' column there.

Now all that is left is to strip whitespace from the column names and values:

stripped_columns = []
for column_name in df.columns:
    df[column_name] = df[column_name].str.strip()
    stripped_columns.append(column_name.strip())
df.columns = stripped_columns

And if you want Age Category to be your index:

df.set_index('Age Category', inplace=True)

Last pass I would make would be to make sure all your columns are now actually holding numbers and not strings:

df = df.astype('int')

Resulting in:

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 21-26 to 46-51
Data columns (total 4 columns):
A    6 non-null int64
B    6 non-null int64
C    6 non-null int64
D    6 non-null int64
dtypes: int64(4)
memory usage: 400.0+ bytes

I am not sure what your reason is for reading it from the clipboard. A bit more elegant solution might be to paste it into a .csv file and use the more advanced features .read_csv() has to offer. The necessary transformations however would remain the same.

why is the `----` so common in all tables if can't be read directly by either SQL or Python without the need for additional processing? — Umar.H, Dec 09 '19 at 15:50
I would say because the true purpose of it is not to transfer data from one environment to another but to give a visual depiction of it. — gosuto, Dec 09 '19 at 15:58

score 7 · Answer 2 · answered Dec 13 '19 at 11:39

Here is another potential solution using re.sub and io.StringIO :

from io import StringIO
import re

text1 = """
| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |
"""

text2= """
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+
"""

df1 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', '  ', text1)), sep='\s{2,}', engine='python')
df2 = pd.read_csv(StringIO(re.sub(r'[|+]|-{2,}', '  ', text2)), sep='\s{2,}', engine='python')

[out]

df1

  Age Category  A   B   C  D
0        21-26  2   2   4  1
1        26-31  7  11  12  5
2        31-36  3   5   5  2
3        36-41  2   4   1  7
4        41-46  0   1   3  2
5        46-51  0   0   2  3

df2

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

score 4 · Accepted Answer · answered Dec 10 '19 at 07:33

One option is to bite the bullet and just preprocess your data. This isn't all that bad, there's only so many cases pd.read_csv can handle in its arguments, and if you want to be exhaustive with the cases you handle you'll eventually end up turning to regex.

To handle most of the common cases of prettyprinted tables, I'd just write a loop to filter out/replace characters in lines, then read in the output using a relatively simpler read_csv call.

import os 

def load(filename):
    with open(filename) as fin, open('temp.txt', 'w') as fout:
        for line in fin:
            if not line.strip()[:2] in {'|-', '+-'}: # filter step
                fout.write(line.strip().strip('|').replace('|', ',')+'\n')

    df = pd.read_csv('temp.txt', sep=r'\s*,\s*', engine='python')
    os.unlink('temp.txt') # cleanup

    return df

df1 = load('data1.txt')
df2 = load('data2.txt')

df1

  Age Category  A   B   C
0        21-26  2   2   4
1        26-31  7  11  12
2        31-36  3   5   5
3        36-41  2   4   1
4        41-46  0   1   3
5        46-51  0   0   2

df2

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

@Datanovice No worries. Did you mean to award the bounty to the other answer? I don't mind either way, just curious. I know this isn't a perfect answer, then again I don't believe this question has one. — cs95, Dec 10 '19 at 19:19
my sincere apologies if I caused any offence, my intention was to have more of an explanation your answer is the correct one but I wanted to know more about why these sort of tables exists or if there was a method to read them that I was ignorant of. — Umar.H, Dec 10 '19 at 22:05
@Datanovice Re:why these tables exist they're likely SQL or Spark table dumps, or they're [`tabulate` prettyprinted tables](https://stackoverflow.com/questions/18528533/pretty-printing-a-pandas-dataframe). — cs95, Dec 11 '19 at 06:06

Jim Eisenberg · Answer 4 · 2019-12-11T09:02:50.237

For this type of table, you can simply use:

df = pd.read_clipboard(sep='|')

Minimal cleanup is then needed:

df = df.drop(0)
df = df.drop(['Unnamed: 0','Unnamed: 6'], axis=1)

As for the "writing such a spreadsheet" question... I don't see how anything could be more convenient than the plain presentation, but here's bad code for it, given the above cleaned df:

df1 = df.append(pd.DataFrame({i:['-'*len(i)] for i in df.columns})).sort_index() #adding the separator to column titles
df2 = pd.DataFrame({str(i)+'|':['|']*len(df1) for i in range(len(df1.columns))})
df3 = df1.join(df2)
col_order = [j for i in [[df1.columns[x], df2.columns[x]] for x in range(len(df1.columns))] for j in i]
df3.index = ['|']*len(df3.index)

Then:

df3[col_order]

    Age Category  0|   A  1|   B   2|   C   3|   D  4|
|  --------------  |  ---  |  ----  |  ----  |  ---  |
|   21-26          |   2   |   2    |   4    |   1   |
|   26-31          |   7   |   11   |   12   |   5   |
|   31-36          |   3   |   5    |   5    |   2   |
|   36-41          |   2   |   4    |   1    |   7   |
|   41-46          |   0   |   1    |   3    |   2   |
|   46-51          |   0   |   0    |   2    |   3   |

(edited)

Parse prettyprinted tabular data with pandas

4 Answers4

Linked