1

I have N Dataframes with different number of columns, I want to get one dataframe with 2 columns x and Y where x is the data from the columns of the input dataframe and Y is the column name itself. I have many such dataframes that I need to concat (N is of the order of 10^2), so efficiency is of priority. A numpy way rather than pandas way is also welcome.

For example,

df1:
   one two
0    1   a
1    2   b
2    3   c
3    4   d
4    5   e

df2: 
  three four
0   NaN     
1  None    f
2          g
3     6    7

Final Output Dataframe:
    x      y
0   1    one
1   2    one
2   3    one
3   4    one
4   5    one
5   a    two
6   b    two
7   c    two
8   d    two
9   e    two
10  6  three
11  f   four
12  g   four
13  7   four

Note: I'm ignoring empty strings, NaNs and Nones in the final dataframe.

user281989
  • 47
  • 7

1 Answers1

0

IIUC you can use melt() before concating:

final=(pd.concat([df1.melt(),df2.dropna().melt()]).
         rename(columns={'variable':'y','value':'x'}). reindex(['x','y'],axis=1))
print(final)
anky
  • 74,114
  • 11
  • 41
  • 70
  • This much better than what I had. I have to do the following because I am reading the dataframes from files, but it takes a long time. Looking for a faster solution. Any ideas? `final=(pd.concat([pd.read_csv('data/' + i).melt() for i in csvs]).rename(columns={'variable':'y','value':'x'}).reindex(['x','y'],axis=1))` – user281989 Jun 21 '19 at 23:17
  • @user281989 break it down into pieces, see if the time taken is due to reading large files(pandas can take time to read large files), or is it that the code which takes time. also you can check [this](https://stackoverflow.com/questions/25508510/fastest-way-to-parse-large-csv-files-in-pandas) – anky Jun 22 '19 at 04:15