5

I am attempting to combine two sets of data, but I can't figure out which method is most suitable (join, merge, concat, etc.) for this application, and the documentation doesn't have any examples that do what I need to do.

I have two sets of data, structured like so:

>>> A
Time     Voltage
1.0      5.1
2.0      5.5
3.0      5.3
4.0      5.4
5.0      5.0

>>> B
Time     Current
-1.0     0.5
0.0      0.6
1.0      0.3
2.0      0.4
3.0      0.7

I would like to combine the data columns and merge the 'Time' column together so that I get the following:

>>> AB
Time     Voltage     Current
-1.0                 0.5
0.0                  0.6
1.0      5.1         0.3
2.0      5.5         0.4
3.0      5.3         0.7
4.0      5.4            
5.0      5.0            

I've tried AB = merge_ordered(A, B, on='Time', how='outer'), and while it successfully combined the data, it output something akin to:

>>> AB
Time     Voltage     Current
-1.0                 0.5
0.0                  0.6
1.0      5.1            
1.0                  0.3
2.0      5.5            
2.0                  0.4
3.0      5.3            
3.0                  0.7
4.0      5.4            
5.0      5.0            

You'll note that it did not combine rows with shared 'Time' values.

I have also tried merging a la AB = A.merge(B, on='Time', how='outer'), but that outputs something combined, but not sorted, like so:

>>> AB
Time     Voltage     Current
-1.0                 0.5
0.0                  0.6
1.0      5.1            
2.0      5.5            
3.0      5.3         0.7
4.0      5.4            
5.0      5.0            
1.0                  0.3
2.0                  0.4

...it essentially skips some of the data in 'Current' and appends it to the bottom, but it does so inconsistently. And again, it does not merge the rows together.

I have also tried AB = pandas.concat(A, B, axis=1), but the result does not get merged. I simply get, well, the concatenation of the two DataFrames, like so:

>>> AB
Time     Voltage     Time     Current
1.0      5.1         -1.0     0.5
2.0      5.5         0.0      0.6
3.0      5.3         1.0      0.3
4.0      5.4         2.0      0.4
5.0      5.0         3.0      0.7

I've been scouring the documentation and here to try to figure out the exact differences between merge and join, but from what I gather they're pretty similar. Still, I haven't found anything that specifically answers the question of "how to merge rows that share an identical key/index". Can anyone enlighten me on how to do this? I only have a few days-worth of experience with Pandas!

J. Day
  • 73
  • 2
  • 7
  • make sure that `Time` column is of the same dtype in both DFs – MaxU - stand with Ukraine May 31 '17 at 21:29
  • Would you move your solution to an answer box, please? We do not put solutions in questions here. If you wish to ask whether it is a good solution you can ask people who have made other answers. – halfer Jun 01 '17 at 09:17
  • @halfer Looks like you edited my post for me. Is there anything left I need to do? I'm afraid I don't know what 'answer box' you're referring to... – J. Day Jun 02 '17 at 16:16
  • If you would like to do as I've suggested, click on the 'Answer Your Question' button below, post your "Solution found" material in the box that appears, and then remove that material from your question. We like to have some separation between question material and answer material here, it makes things clearer. Thanks! – halfer Jun 02 '17 at 16:22

3 Answers3

8

merge
merge combines on columns. By default it takes all commonly named columns. Otherwise, you can specify which columns to combine on. In this example, I chose, Time.

A.merge(B, 'outer', 'Time')

   Time  Voltage  Current
0   1.0      5.1      0.3
1   2.0      5.5      0.4
2   3.0      5.3      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN
5  -1.0      NaN      0.5
6   0.0      NaN      0.6

join
join combines on index values unless you specify the left hand side's column instead. That is why I set the index for the right hand side and Specify a column for the left hand side Time.

A.join(B.set_index('Time'), 'Time', 'outer')

   Time  Voltage  Current
0   1.0      5.1      0.3
1   2.0      5.5      0.4
2   3.0      5.3      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN
4  -1.0      NaN      0.5
4   0.0      NaN      0.6    ​

pd.concat
concat combines on index values... so I create a list comprehension in which I iterate over each dataframe I want to combine [A, B]. In the comprehension, each dataframe assumes the name d, hence the for d in [A, B]. axis=1 says to combine them side by side thus using the index as the joining feature.

pd.concat([d.set_index('Time') for d in [A, B]], axis=1).reset_index()

   Time  Voltage  Current
0  -1.0      NaN      0.5
1   0.0      NaN      0.6
2   1.0      5.1      0.3
3   2.0      5.5      0.4
4   3.0      5.3      0.7
5   4.0      5.4      NaN
6   5.0      5.0      NaN

combine_first

A.set_index('Time').combine_first(B.set_index('Time')).reset_index()

   Time  Current  Voltage
0  -1.0      0.5      NaN
1   0.0      0.6      NaN
2   1.0      0.3      5.1
3   2.0      0.4      5.5
4   3.0      0.7      5.3
5   4.0      NaN      5.4
6   5.0      NaN      5.0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • So then there indeed is little-to-no difference in `join` vs. `merge`? Interesting. And by your example, it looks like `concat` results in exactly what I desired, albeit with some arguments that I don't understand the function of. Can you explain what's going on there? Where is `d` coming from? And what's the purpose of `reset_index`? (I thought `Time` _was_ my index?) – J. Day May 31 '17 at 21:59
  • To expand on the use of `concat` in this case: I attempted to implement it as you have written, but resulted in a similar output as with `merge`. That is: the data combined _mostly_ as I had hoped for, but every Nth row does not merge, as shown in my original post. Considering that multiple methods are producing identical results, it would seem to me that the problem isn't necessarily the method, but something in my data. Could it be an issue with size? Speed? What would cause it to regularly skip combining rows? – J. Day May 31 '17 at 22:05
  • 1
    I think @MaxU is identifying you issue. Try `A = A.astype.float()` and `B = B.astype(float)` prior to joining. – piRSquared May 31 '17 at 22:09
  • Just tried this, but there was no change in the output. Every 8th or 9th row doesn't combine. :( – J. Day May 31 '17 at 22:39
  • @J.Day Can you output `A.dtypes` and `B.dtypes` and include it in your post? – piRSquared May 31 '17 at 22:41
  • I verified that the dtypes of 'Time' in both DF's is 'float64'. However, I found a workaround based on @MaxU's respond above, and I have updated my initial post to reflect it. Is there anything wrong with the method I posted in my edit? – J. Day May 31 '17 at 23:19
  • 1
    @J.Day That looks good to me. I'd definitely choose MaxU's answer. – piRSquared May 31 '17 at 23:21
2

It should work properly if the Time column is of the same dtype in both DFs:

In [192]: A.merge(B, how='outer').sort_values('Time')
Out[192]:
   Time  Voltage  Current
5  -1.0      NaN      0.5
6   0.0      NaN      0.6
0   1.0      5.1      0.3
1   2.0      5.5      0.4
2   3.0      5.3      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN

In [193]: A.dtypes
Out[193]:
Time       float64
Voltage    float64
dtype: object

In [194]: B.dtypes
Out[194]:
Time       float64
Current    float64
dtype: object

Reproducing your problem:

In [198]: A.merge(B.assign(Time=B.Time.astype(str)), how='outer').sort_values('Time')
Out[198]:
   Time  Voltage  Current
5  -1.0      NaN      0.5
6   0.0      NaN      0.6
0   1.0      5.1      NaN
7   1.0      NaN      0.3
1   2.0      5.5      NaN
8   2.0      NaN      0.4
2   3.0      5.3      NaN
9   3.0      NaN      0.7
3   4.0      5.4      NaN
4   5.0      5.0      NaN

In [199]: B.assign(Time=B.Time.astype(str)).dtypes
Out[199]:
Time        object   # <------ NOTE
Current    float64
dtype: object

Visually it's hard to distinguish:

In [200]: B.assign(Time=B.Time.astype(str))
Out[200]:
   Time  Current
0  -1.0      0.5
1   0.0      0.6
2   1.0      0.3
3   2.0      0.4
4   3.0      0.7

In [201]: B
Out[201]:
   Time  Current
0  -1.0      0.5
1   0.0      0.6
2   1.0      0.3
3   2.0      0.4
4   3.0      0.7
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • I've verified that the datatypes are the same (float64) for the 'Time' column in both DF's. Using the `merge` method as you've suggested results in similar inconsistent merging, similar to my example of `A.merge` in my initial post. In this instance, the data DOES get ordered, but every Nth row doesn't combine. I wish I could include an example but the char count is too low... – J. Day May 31 '17 at 21:56
  • @J.Day, try this: `pd.merge_ordered(A.assign(Time=A.Time.round(4)), B.assign(Time=B.Time.round(4)))` – MaxU - stand with Ukraine May 31 '17 at 22:05
  • I think I understand where this is going. The only caveat is that my actual data column name is 'Time, (sec)', not just 'Time' (I simplified my initial example for the sake of brevity). How can I write this line to accommodate the extra punctuation in the name? – J. Day May 31 '17 at 22:33
0

Solution found As per the suggestions below, I had to round the numbers in the 'Time' column prior to merging them, despite the fact that they were both of the same dtype (float64). The suggestion was to round like so:

A = A.assign(A.Time = A.Time.round(4))

But in my actual situation, the column was labeled 'Time, (sec)' (there was punctuation that screwed with the assignment. So instead I used the following line to round it:

A['Time, (sec)'] = A['Time, (sec)'].round(4)

And it worked like a charm. Are there any issues with doing it like that?

J. Day
  • 73
  • 2
  • 7