How to get two dataframes from one csv with multiple index columns

Question

I have a CSV file like that:

Time [s],Channel 0-Analog, Time [s],Reset-Digital, Time [s],Channel 1-Digital, Time [s],Channel 2-Digital, Time [s],Channel 3-Digital
-0.002204166666667, 2048.000000000000000, -0.002204166666667, 1, -0.002204166666667, 0, -0.002204166666667, 1, -0.002204166666667, 1
-0.002204000000000, 2048.000000000000000, -0.001124000000000, 0, -0.001504666666667, 1, -0.001448500000000, 0, -0.000199666666667, 0
-0.002203833333333, 2048.000000000000000, -0.000000000000000, 1, 0.000301666666667, 0, 0.000841666666667, 1, 0.000056333333333, 1
-0.002203666666667, 2048.000000000000000, 0.000550833333333, 0, 0.000932000000000, 1, 0.003178666666667, 0, 0.002361000000000, 0
-0.002203500000000, 2048.000000000000000, 0.003259333333333, 1, 0.002538166666667, 0, 0.005142333333333, 1, 0.004062000000000, 1
-0.002203333333333, 2048.000000000000000, 0.005602833333333, 0, ...

And want to have a single data frame with just one time "line".

The idea was to create two data frames and merge them to one with resp to column Time [s]. So I created that sequence.

df1 = pd.read_csv('untitled.csv',usecols=[2,3])
df2 = pd.read_csv('untitled.csv',usecols=[4,5])

merged = pd.merge(df1,df2,on=r'Time [s]')

But it did not work. KeyError: 'Time [s]'

/**************************************************************************/

I figured out that pandas is adding a numbering to the columns that are duplicated. So I changed my code like this.

df1 = pd.read_csv('untitled.csv',usecols=[2,3])
df2 = pd.read_csv('untitled.csv',usecols=[4,5])
df1.columns = df1.columns.str.strip('.123 ')
df2.columns = df2.columns.str.strip('.123 ')
merged =pd.merge(df1,df2,on=r'Time [s]',how='outer')
merged.set_index(r'Time [s]')

But now I have the issue that the index is just sorted for elements that have no NaN. Means first all rows where both columns have Numbers, then where just the first column has no NaN and then where just the second column has no NaN.

           Reset-Digital  Channel 1-Digital
   Time [s]                                   
   -0.002204            1.0                0.0
   -0.001124            0.0                NaN
   -0.000000            1.0                NaN
    0.000551            0.0                NaN
              ...                  ...
   -0.001505            NaN                1.0
    0.000302            NaN                0.0
    0.000932            NaN                1.0
    0.002538            NaN                0.0
              ...                    ...

I need it in this format

           Reset-Digital  Channel 1-Digital
   Time [s]                                   
    0.000302            NaN                0.0
    0.000551            0.0                NaN
    0.000932            NaN                1.0
    0.002538            NaN                0.0
   -0.000000            1.0                NaN
   -0.001124            0.0                NaN
   -0.001505            NaN                1.0
   -0.002204            1.0                0.0
              ...                    ...

I would recommend you ask a new question, this is a different problem now, and it's not clear to me what you want. — IanS, Jul 26 '17 at 07:07
@Peter - Do you need 2 dataframes - one for `Analog` columns with time columns and second fro digital with `time` columns? — jezrael, Jul 26 '17 at 09:15
@jezrael No one frame is ok, im just interested into the digital signals — Peter, Jul 26 '17 at 10:54

score 0 · Answer 1 · edited Aug 30 '17 at 06:15

I have a simpler suggestion using pd.melt:

read the csv into a single data frame with your columns of interest;
select columns name containing Time as keys, and columns name containing Channel as values;
if you want, you can use df.drop("variable", axis=1) to get rid of the extra column created by melt.

code sample

df = pd.read_csv('untitled.csv')
keys = [col for col in df.columns if col.startswith('Time')]
values = [col for col in df.columns if col.startswith('Channel')]
pd.melt(df, id_vars=values, value_vars=keys, value_name='Time')

Note: my answer was inspired by this one :-)

jezrael · Accepted Answer · 2017-07-27T05:23:38.187

Solution works if all columns names are unique and Time columns are previous cols of signal columns:

#get all columns with Digital text
d = df.columns[df.columns.str.contains('Digital')]
print (d)
Index(['Reset-Digital', 'Channel 1-Digital', 'Channel 2-Digital',
       'Channel 3-Digital'],
      dtype='object')

#get all previous columns (Time columns)
#for new versions of pandas for Time columns are added 1,2..for no duplicates
td = df.columns[df.columns.get_indexer(d) - 1]
print(td)
Index(['Time [s].1', 'Time [s].2', 'Time [s].3', 'Time [s].4'], dtype='object')

#zip time and signal column and concat data
df = pd.concat([df.set_index(x[0])[x[1]] for x in zip(td, d)], axis=1)
print (df)
          Reset-Digital  Channel 1-Digital  Channel 2-Digital  \
-0.002204            1.0                0.0                1.0   
-0.001505            NaN                1.0                NaN   
-0.001448            NaN                NaN                0.0   
-0.001124            0.0                NaN                NaN   
-0.000200            NaN                NaN                NaN   
-0.000000            1.0                NaN                NaN   
 0.000056            NaN                NaN                NaN   
 0.000302            NaN                0.0                NaN   
 0.000551            0.0                NaN                NaN   
 0.000842            NaN                NaN                1.0   
 0.000932            NaN                1.0                NaN   
 0.002361            NaN                NaN                NaN   
 0.002538            NaN                0.0                NaN   
 0.003179            NaN                NaN                0.0   
 0.003259            1.0                NaN                NaN   
 0.004062            NaN                NaN                NaN   
 0.005142            NaN                NaN                1.0   

           Channel 3-Digital  
-0.002204                1.0  
-0.001505                NaN  
-0.001448                NaN  
-0.001124                NaN  
-0.000200                0.0  
-0.000000                NaN  
 0.000056                1.0  
 0.000302                NaN  
 0.000551                NaN  
 0.000842                NaN  
 0.000932                NaN  
 0.002361                0.0  
 0.002538                NaN  
 0.003179                NaN  
 0.003259                NaN  
 0.004062                1.0  
 0.005142                NaN

Maybe you can have a look on the lower part of my question. There I entered my expectation. — Peter, Jul 26 '17 at 19:00
My solution is similar as your, but I cannot simulate your second output. Because `concat` sort values by indes (here by times), and if value not exist add NaNs. Can you explain more wos is necessary sort times? And why `0.000302` is first? Also if concat more columns, what is desired output? Thanks. — jezrael, Jul 27 '17 at 05:26

How to get two dataframes from one csv with multiple index columns

2 Answers2