6

I am trying to read some excel files in pandas. In some files, the table of interest is not perfectly formatted, i.e. multiple rows are formatted as a single row but each such row has multiple lines. So the data appears fine when you view the excel file. Also when parsing it using pandas, there is indeed a newline character (\n) at the end of each such line.

The problem is that when I read it with read_excel() function, it converts it into a DataFrame which does not consider this line break as a separate row but puts it into one row with \n in it. I would like to write a code that treats/converts each such row with N lines as N rows (using the line-breaks as an indicator for new row).

Is there a way to do it either while parsing the file or post-processing the dataframe in Python?

Here I provide a very simplified version of my dummy excel-file and some code to explain the problem.

Sample Excel-File:

Name                | Price
-------------------------------
Coca Cola           |     46.66
-------------------------------
Google              |   1204.44
Facebook            |    177.58
-------------------------------
Berkshire Hathaway  | 306513.75

I simply use Pandas' read_excel in Python:

dataframe_parsed = pandas.read_excel(file_name)
print(dataframe_parsed.head())

I get the following DataFrame as output:

                 Name            Price
0           Coca Cola            46.66
1    Google\nFacebook  1204.44\n177.58
2  Berkshire Hathaway        306513.75

The desired output is:

                 Name           Price
0           Coca Cola           46.66
1              Google         1204.44
2            Facebook          177.58
3  Berkshire Hathaway       306513.75

Any help will be highly appreciated.

Frida Schenker
  • 1,219
  • 1
  • 9
  • 14

2 Answers2

5

After split you can check with unnesting

yourdf=unnesting(df.apply(lambda x : x.str.split(r'\\n')),['Name','Price'])
yourdf
Out[50]: 
                 Name      Price
0           Coca Cola      46.66
1              Google    1204.44
1            Facebook     177.58
2  Berkshire Hathaway  306513.75

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

Since you mentioned above does not work

df.apply(lambda x : x.str.split(r'\\n')).stack().apply(pd.Series).stack().unstack(level=1).reset_index(drop=True)
Out[57]: 
                 Name      Price
0           Coca Cola      46.66
1              Google    1204.44
2            Facebook     177.58
3  Berkshire Hathaway  306513.75
BENY
  • 317,841
  • 20
  • 164
  • 234
  • @FridaSchenker you have empty row , which will leading to the error, if you sure your data format is same like what you show to us, my code will work – BENY Apr 10 '19 at 16:34
  • Thanks for your help. But sorry, it is not working. I checked again and there is no empty row in my dataframe. Your second piece of code just converts the second column into NaN. So still no solution. – Frida Schenker Apr 10 '19 at 16:49
  • @FridaSchenker sry, I can not using that link . maybe you can come up a better sample data can reproduced the error code when you try my solution – BENY Apr 10 '19 at 22:42
1

Thanks WenBen for your help. But I couldn't get your code to produce the desired output. However, using your unnesting link, I came up with a solution, with some help from the answer from @user3483203 on that page. I post the solution here, just in case it helps someone facing a similar problem:

import pandas as pd
import numpy as np

def main():
    # Make a simple dummy dataframe for testing
    my_dataframe = pd.DataFrame({'ColA':["a1", "a2\na3", "a4\n a5 space"],'ColB':["b1", "b2\nb3","b4\nb5"]})
    print("DataFrame before:\n", my_dataframe.head())

    my_dataframe_after = myUnnesting(my_dataframe)
    print("DataFrame after:\n", my_dataframe_after.head())

def myUnnesting(dataframe):
    new_dataframe = pd.DataFrame()
    for column in dataframe:
        # Convert each column into an array of lists and concatenate these lists into a single array 
        col_vals = np.concatenate(np.array(dataframe[column].str.split("\n")))
        new_dataframe[column] = col_vals 

    return new_dataframe

if __name__ == "__main__":
    main()

The Output:

DataFrame before:
             ColA    ColB
0             a1      b1
1         a2\na3  b2\nb3
2  a4\n a5 space  b4\nb5
DataFrame after:
         ColA ColB
0         a1   b1
1         a2   b2
2         a3   b3
3         a4   b4
4   a5 space   b5

Of course this solution assumes that for a given row, there is an equal number of \n in each column. This assumption works perfectly fine for the data which I am processing. However, if someone reading this would like to post a more generic solution which also works with other cases, it would be highly appreciated. Thanks.

Frida Schenker
  • 1,219
  • 1
  • 9
  • 14