0

I think I messed up trying to save a Pandas Series that contained a bunch of Pandas Dataframes. Turns out that the DataFrames were each saved as if I called df.to_string() on them.

From my observations so far, my strings have extra spacing in some places, as well as extra \ when the DataFrame has too many columns to be displayed on the same row.

Here is a "more appropriate DataFrame:

df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6, 
                  data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])

The strings that I have and wish to turn into a DataFrame look like this:

# str(df)

'  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \n0                     some really long data                           2  3  \n1                                         4                           5  6  \n2                                         7                           8  9  '

How would I revert a string like this back to a DataFrame?

Thanks

AsheKetchum
  • 1,098
  • 3
  • 14
  • 29

3 Answers3

2

New answer

In response to your new, edited question, the best answer I have is to use to_csv instead of to_string. to_string doesn't really support this use case as well as to_csv (and I don't see how I can save you from doing a bunch of conversions to and from StringIO instances...).

df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6, 
                  data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])
s = StringIO()
df.to_csv(s)
# To get the string use, `s.getvalue()`
# Warning: will exhaust `s`

pd.read_csv(StringIO(s.getvalue()))

I hope this update helps, I'll leave my old answer for continuity.


Old answer

In a very cool twist, the answer to this will also help you read a commonly pasted format of dataframe output on stackoverflow. Consider that we can read a df from a string like so:

data = """    0   20   30   40   50
 1  5  NaN   3    5   NaN
 2  2   3    4   NaN   4
 3  6   1    3    1   NaN"""

import pandas as pd
from io import StringIO
data = StringIO(data)
df = pd.read_csv(data, sep="\\s+")

This results in the following df:

enter image description here

You can read the output of to_string the same way:

pd.read_csv(StringIO(df.to_string()), sep="\\s+")

And the resulting df is the same.

Charles Landau
  • 4,187
  • 1
  • 8
  • 24
  • does this work if there are strings that contain spaces? – AsheKetchum Jan 23 '19 at 23:21
  • My current problem is that I have a bunch of strings (not the DataFrames), and I wish to recover the Dataframes. `str(df)` was just an example of what my strings look like. Sorry for the confuction – AsheKetchum Jan 23 '19 at 23:44
  • @AsheKetchum how are the strings delimited? If unescaped delimiter can be found in the data then it seems a bit unreasonable to expect to be able to parse it without more constraints – Charles Landau Jan 23 '19 at 23:57
  • I'm still trying to figure out what the delimiter is. I tried `[i for i in str(df).split("\n"]` which kinda shows the line by line composition. – AsheKetchum Jan 24 '19 at 00:02
  • That's the newline character, the delimiter splits columns (e.g. comma in a csv), and what I'm saying is that whoever exported those strings appears to have done so without their delimiters – Charles Landau Jan 24 '19 at 00:09
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187214/discussion-between-asheketchum-and-charles-landau). – AsheKetchum Jan 24 '19 at 00:13
1

Try this. Updated to include logic to automatically calculate number of rows. Basically I extract the largest value of your original dataframe's index (row number), which is inside the big string.

If we begin with a dataframe converted to a string using the example you gave:

df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6, 
                  data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])

string = str(df)
  1. First, let's extract the column names:
import re
import numpy as np

lst = re.split('\n', string)
num_rows = int(lst[lst.index('') -1][0]) + 1
col_names = []
lst = [i for i in lst if i != '']

for i in range(0,len(lst), num_rows + 1):
    col_names.append(lst[i])

new_col_names = []
for i in col_names:
    new_col_names.append(re.split('  ', i))

final_col_names = []
for i in new_col_names:
    final_col_names += i

final_col_names = [i for i in final_col_names if i != '']
final_col_names = [i for i in final_col_names if i != '\\']
  1. Then, let's get the data:
for i in col_names:
    lst.remove(i)

new_lst = [re.split(r'\s{2,}', i) for i in lst]
new_lst = [i[1:-1] for i in new_lst]

newer_lst = []
for i in range(num_rows):
    sub_lst = []
    for j in range(i,len(final_col_names), num_rows):
        sub_lst += new_lst[j]
    newer_lst.append(sub_lst)

reshaped = np.reshape(newer_lst, (num_rows,len(final_col_names)))
  1. Finally, we can create the reconstructed dataframe using the data and column names:
fixed_df = pd.DataFrame(data=reshaped, columns = final_col_names)

My code performs some loops so if your original dataframe had hundreds of thousands of rows, this approach could take a while.

James Dellinger
  • 1,281
  • 8
  • 9
  • 1
    Thanks! I wish I had an easy way to get number of rows :) I do have number of columns though, which with some counting and division gives me the number of rows. – AsheKetchum Jan 24 '19 at 01:17
  • No problem. I just realized that it's possible to extract the number of rows, and updated my code to do this automatically: `num_rows = int(lst[lst.index('') -1][0]) + 1` – James Dellinger Jan 24 '19 at 01:26
0

So, not really sure how helpful this question was or will be to anyone, but I wrote a function (and a helper) to try and bring back my data that I mistakenly stored as DataFrames nested in a pd.Series.

Here are the functions:

def insertNan(substring):
    rows = substring.split('\n')
    headers = re.sub("  \s+", "  ", rows[0].replace("\\","").strip()).split("  ")
    #  The [2] below is a placeholder for the index. (Look in str(df), may appear like "\\\n1")
    # Notice that if your tables get past 100 rows, 2 needs to be 3, or be determined otherwise.
    boundaries = [0] + [2] + [rows[0].find(header)+len(header) for header in headers]
    values = []
    for i, row in enumerate(rows):
        values.append(row)
        # First row is just column headers. If no headers then don't use these functions
        if i==0:
            continue
        for j, bound in enumerate(boundaries[:-1]):
            value = row[bound:boundaries[j+1]].strip()
            if not value:
                newstring = list(values[i])
                newstring[boundaries[j+1]-3:boundaries[j+1]] = "NaN"
                values[i] = ''.join(newstring)
            if "  " in value:
                start = values[i].find(value)
                newvalue = re.sub(" \s+", " ", value)
                values[i] = values[i][:start]+newvalue+values[i][start+len(value)]
    return '\n'.join(values)

def from_string(string):
    string = string.replace("\\", "")
    chunks = [insertNan(i).strip() for i in string.split("\n\n")]
    frames = [pd.read_csv(StringIO(chunk), sep=" \\s+", engine='python') 
              for chunk in chunks]
    return pd.concat(frames, axis=1)

# Read file and loop through series. These two lines might have to be modified.
corrupted_results = pd.read_excel(fileio, squeeze=True)
results = [from_string(result for result in corrupted_results.values

That pretty much brought me back to the pd.Series (results) that I started with.

Except some overly long text entries got cut off with "...".

So to conclude, saving data as DataFrames nested in pd.Series is probably just a bad idea. I have now decided to save a concatenated DataFrame made by concatenating the DataFrames with an added "name" column which allows me to separate using .groupby later on if needed.

As a side note, if the DataFrames saved in the pd.Series have no header, the functions I have provided will probably not work unless modified.

Special thanks to ColdSpeed, Charles Landau, and JamesD for their time, help and kindness!

AsheKetchum
  • 1,098
  • 3
  • 14
  • 29