Concatenate several CSV files using two columns

Question

I have several csv files one for blood pressure for patients and others for heart rate, WBc,etc for the same patients and the same hour measure , please see the following example

First csv:

    subject_id     hour_id    value         label
    
    1                 1           96        blood pressure
    1                 1           94        blood pressure

    1                 1           93        blood pressure

    2                 2           99        blood pressure

Second csv:

    subject_id     hour_id    value        label
     
    1                 1           80        Heart rate
    
    2                 2           89        Heart rate
    2                 2           81        Heart rate

third csv:

    subject_id     hour_id    value        label
     
    1                 1           1        WBC
    
    2                 2           10       WBC
    2                 2           12       WBC

Fourth csv:

    subject_id     hour_id    value        label
     
    1                 1         123        glucose
    
    2                 2        111           glucose
    2                 2        113           glucose

Desired output:

    subject_id     hour_id     blood_pressure    heart rate    WBC     gloucose
    
    1                 1           96               80           1         123
    
    2                 2           99               89           10         120

I tried:

df = pd.read_csv('D:\\....', low_memory=False, error_bad_lines=False)
df2 = pd.read_csv('D:\\Users', low_memory=False, error_bad_lines=False)
merged = pd.concat([df, df2,df3,df4], axis=1, keys=['subject_id', 'hour_mesaure'])
print(merged)

But it gives me:

  subject_id     hour_id        blood_pressure     
    
    1                 1              96           
         
    2                 2               99   


   subject_id     hour_id    value        label
     
    1                 1           80        Heart rate
    
    2                 2           89        Heart rate

and complete the files sequentially

any help will be appreciated

why python 2.7 & python 3.x - do you have plans to run the code on python 2.7? — balderman, Sep 09 '21 at 20:38
no just want to be visible to all persons interested in python — Nora Mahmoud, Sep 09 '21 at 20:39
2.7 should not be in use unless you have a very good reason to use it. I will remove this tag. — balderman, Sep 09 '21 at 20:40
I think you want something more like merge, not concat. See the accepted answer [here](https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes). You would just merge on subject_id. That should produce what you want. — brobertsp, Sep 09 '21 at 21:15

score 1 · Answer 1 · answered Sep 09 '21 at 20:22

1

You basically need to pivot the data after the concat.

you can proceed like this:

merged = pd.concat([df, df2,df3,df4])

after this you need to pivot the data:

merged.pivot(index = ['subject_id', 'hour_id'], columns = ['label'], values = ['value'])

answered Sep 09 '21 at 20:22

Senpaivg

21
1
3

i tried this solution but it give me this error "ValueError: Index contains duplicate entries, cannot reshape" – Nora Mahmoud Sep 09 '21 at 20:38

balderman · Answer 2 · 2021-09-09T21:06:11.800

Without the usage of any external lib.
The idea is to collect the data into a dict and iterate over the dict and create the output.
1.csv & 2.csv contain the BP & HR data.
Extend the list [1, 2] in order to add more input files.

The output is comma separated but you can change it if you feel like.

from collections import defaultdict

data = defaultdict(list)

for x in [1, 2]:
    with open(f'{x}.csv') as f:
        lines = [l.strip() for l in f.readlines() if l.strip()]
        for idx, line in enumerate(lines):
            if idx > 0:
                parts = line.split()
                data[(parts[0], parts[1])].append((parts[2], parts[3]))

with open('merged.csv','w') as f:
    for idx, (k, v) in enumerate(data.items()):
        if idx == 0:
            headers = ['subject_id', 'hour_id']
            headers.extend(x[1] for x in v)
            f.write(','.join(headers) + '\n')
        fields = [k[0], k[1]]
        fields.extend(x[0] for x in v)
        f.write(','.join(fields) + '\n')

output

subject_id,hour_id,blood,Heart
1,1,96,80
2,2,99,89

The csv files and the python script should be in the same folder. Give it a try - it works. Save the code as python script (Example: 'csv_merger.py' ). Run it and you should see the output. — balderman, Sep 09 '21 at 20:59

Sabil · Answer 3 · 2021-09-10T08:55:58.443

0

You can try this:

import pandas as pd


df1 = pd.read_csv('1.csv')
df2 = pd.read_csv('2.csv')
df3 = pd.read_csv('3.csv')
df4 = pd.read_csv('4.csv')

dfs = [df1, df2, df3, df4]

df = pd.concat(dfs)
df = df.pivot(index=['subject_id', 'hour_id'], columns='label', values='value').reset_index().rename_axis(index=None, columns=None)

print(df)

Output:

   subject_id  hour_id  Heart rate  WBC  blood pressure  glucose
0           1        1          80    1              96      123
1           2        2          89   10              99      120

Online Live Demo Link: https://replit.com/@tssovi/test#main.py

edited Sep 10 '21 at 08:55

answered Sep 09 '21 at 21:11

Sabil

3,750
1
5
16

i tried this , but it also give me the following error "ValueError: Index contains duplicate entries, cannot reshape" – Nora Mahmoud Sep 10 '21 at 08:04
I just run the code and it shows me the same result that I add in answer. Could you please try again or share the code that you tried? – Sabil Sep 10 '21 at 08:15
Then there should be some other issue. – Sabil Sep 10 '21 at 08:53
@NoraMahmoud I just update the answer and add live demo link. Could you please try that? – Sabil Sep 10 '21 at 09:01
i know the problem with me come from which – Nora Mahmoud Sep 10 '21 at 10:46
i update the question with the problem i have – Nora Mahmoud Sep 10 '21 at 10:47

Concatenate several CSV files using two columns

3 Answers3