concatenate dataframes with variable row sizes

Question

I have two csv files which have different row numbers.

test1.csv
num,sam
1,1.2
2,1.13
3,0.99

test2.csv
num,sam
1,1.2
2,1.1
3,0.99
4,1.02

I would like to read the sam columns and append them to an empty dataframe. Thing is that, when I read test1.csv, I extract the base file name, test1 and want to append the sam column based on the `column header in the empty dataframe.

big_df = pd.DataFrame(columns =['test1','test2'])
pwd = os.getcwd()
for file in os.listdir(pwd):
     filename = os.fsdecode(file)
     if filename.endswith(".csv"):
         prog = filename.split('.')[0] # test1 test2
         df = pd.read_csv(filename, usecols=['sam'])
         # The read dataframe has one column
         # Move/append that column to the big_df where column == prog
         big_df[prog] = df
print(big_df)

But big_df misses the fourth row of test2.csv.

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99

I expect to see

   test1  test2
0   1.20   1.20
1   1.13   1.1
2   0.99   0.99
3   NaN    1.02

How can I fix that?

Read the dataframes then merge them as you please: [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — Mr. T, Feb 04 '22 at 12:03
@Mr.T: I understand the concept, but with `big_df.merge(df, how='outer')` I get the error `No common columns to perform merge on`. So, that is because of different columns in `big_df` and `df`. — mahmood, Feb 04 '22 at 12:50
Can you specify if the 'num' columns should be taken into account? More specifically, what should happen when one of them is not a continuous sequence starting at 1? — hpchavaz, Feb 04 '22 at 15:36

score 2 · Answer 1 · answered Feb 04 '22 at 12:19

2

Using pandas.concat and a simple dictionary comprehension:

files = ['test1.csv', 'test2.csv']
df = pd.concat({f.rsplit('.', 1)[0]: pd.read_csv(f).set_index('num')['sam']
                for f in files}, axis=1)

output:

     test1  test2
num              
1     1.20   1.20
2     1.13   1.10
3     0.99   0.99
4      NaN   1.02

answered Feb 04 '22 at 12:19

mozway

194,879
13
39
75

That is possible. But I was trying to read the files in a loop as the final code deals with large number of files. So, statically specifying the file names is not a preferred one. – mahmood Feb 04 '22 at 12:27
1

@mahmood you can use other ways to feed in the files. `glob` for instance or a function/generator – mozway Feb 04 '22 at 12:28
@mozway, would n't it be more memory efficient to use a generator instead of a dict , as: `(pd.read_csv(f).set_index('num')['sam'].rename(f.split('.')[0])`. If there are a large number of files it seems it could make a difference. – hpchavaz Feb 04 '22 at 17:57
@hpchavaz I haven't tested but I imagine that it should. You can also pass the labels as parameters to `concat`. – mozway Feb 04 '22 at 18:54

score 1 · Accepted Answer · answered Feb 04 '22 at 12:06

You could approach it differently and use concat instead of creating an empty data frame in the first place. Might be also a bit more efficient. In code that reads like

def get_columns():
    for file in os.listdir(pwd):
        filename = os.fsdecode(file)
        if filename.endswith(".csv"):
            prog = filename.split('.')[0] # test1 test2
            yield pd.read_csv(filename, usecols=['sam'])['sam'].rename(prog)
big_df = pd.concat(get_columns(), axis=1)

Otherwise, you could use merge with outer as mentioned in the comment.

The question is slightly ambiguous. The answer is correct if the 'num' columns have no meaning. On the other hand, if they are to be taken into account, and one of the 'num' columns is not a continuous sequence starting at 1, the result could be different from what is desired. — hpchavaz, Feb 04 '22 at 15:38

score 0 · Answer 3 · answered Feb 04 '22 at 12:18

read bothe csv's first using pd.read_csv("filename",sep=',')

df1
   num   sam
0    1  1.20
1    2  1.13
2    3  0.99

df2
   num   sam
0    1  1.20
1    2  1.10
2    3  0.99
3    4  1.02

the do the following

df2.drop('num',axis=1,inplace=True)
df2.columns=['test2']
df2['test1']=df1['sam']

output:

   test2  test1
0   1.20   1.20
1   1.10   1.13
2   0.99   0.99
3   1.02    NaN

concatenate dataframes with variable row sizes

3 Answers3