1

I have several - let's say three - data frames that contain different rows (sometimes they can overlap) of another data frame. The columns are the same for all three dfs. I want now to create final data frame that will contain all the rows from three mentioned data frames. Moreover I need to generate a column for the final df that will contain information in which one of the first three dfs this particular row is included.

Example below

Original data frame:

 original_df = pd.DataFrame(np.array([[1,1],[2,2],[3,3],[4,4],[5,5],[6,6]]), columns = ['label1','label2'])

Three dfs containing different pieces of the original df:

a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]

I want to get the following data frame:

final_df = pd.DataFrame(np.array([[1,1,'a'],[2,2,'a'],[3,3,'b'],[4,4,'c'],\
[5,5,'c'],[6,6,'c']]), columns = ['label1','label2', 'from which df this row'])

or simply use integers to mark from which df the row is:

final_df = pd.DataFrame(np.array([[1,1,1],[2,2,1],[3,3,2],[4,4,3],\
[5,5,3],[6,6,3]]), columns = ['label1','label2', 'from which df this row'])

Thank you in advance!

glebrwl
  • 11
  • 2

3 Answers3

1

See this related post

IIUC, you can use pd.concat with the keys and names arguments

pd.concat(
    [a, b, c], keys=['a', 'b', 'c'],
    names=['from which df this row']
).reset_index(0)

  from which df this row  label1  label2
0                      a       1       1
1                      a       2       2
2                      b       3       3
3                      c       4       4
4                      c       5       5
5                      c       6       6

However, I'd recommend that you store those dataframe pieces in a dictionary.

parts = {
    'a': original_df.loc[0:1],
    'b': original_df.loc[2:2],
    'c': original_df.loc[3:]
}

pd.concat(parts, names=['from which df this row']).reset_index(0)

  from which df this row  label1  label2
0                      a       1       1
1                      a       2       2
2                      b       3       3
3                      c       4       4
4                      c       5       5
5                      c       6       6

And as long as it is stored as a dictionary, you can also use assign like this

pd.concat(d.assign(**{'from which df this row': k}) for k, d in parts.items())

   label1  label2 from which df this row
0       1       1                      a
1       2       2                      a
2       3       3                      b
3       4       4                      c
4       5       5                      c
5       6       6                      c

Keep in mind that I used the double-splat ** because you have a column name with spaces. If you had a column name without spaces, we could do

pd.concat(d.assign(WhichDF=k) for k, d in parts.items())

   label1  label2 WhichDF
0       1       1       a
1       2       2       a
2       3       3       b
3       4       4       c
4       5       5       c
5       6       6       c
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

Just create a list and in the end concatenate:

list_df = []
list_df.append(df1)
list_df.append(df2)
list_df.append(df3)
df = pd.concat(liste_df)
s_khan92
  • 969
  • 8
  • 21
0

Perhaps this can work / add value for you :)

import pandas as pd

# from your post
a = original_df.loc[0:1, columns]
b = original_df.loc[2:2, columns]
c = original_df.loc[3:, columns]

# create new column to label the datasets
a['label'] = 'a'
b['label'] = 'b'
c['label'] = 'c'

# add each df to a list
combined_l = []
combined_l.append(a)
combined_l.append(b)
combined_l.append(c)

# concat all dfs into 1
df = pd.concat(liste_df)
Community
  • 1
  • 1
Anthony R
  • 2,739
  • 1
  • 12
  • 11