Duplicate rows when merging data frames Pandas

Question

I am currently working on a code that reads and combines two data frames and outputs them into a separate CSV. However, when getting the result, it seems like duplicates occur. Both data frames have data recorded in 10 hz. However one is recorded in 1 seconds intervals (There will for example be 10 cells with the time-column value 14:51:21). The other one has milliseconds (14:51:21.352, 14:51:21.452, ...). When I merge the data frames I get an output that is 10x longer than the original data. My guess is that the data with 1 second intervals merges to 10 different cells.

Here's an example of what my data frames look like:

Data frame 1:

TIME                    FUEL    OTHER    OTHER2
...
2020-12-03 22:06:16.944 124     49       157
2020-12-03 22:06:17.044 113     42       132
2020-12-03 22:06:17.144 144     11       111
2020-12-03 22:06:17.244 224     29       557
2020-12-03 22:06:17.344 1224    422      151
2020-12-03 22:06:17.444 324     56       257
2020-12-03 22:06:17.544 18      29       157
2020-12-03 22:06:17.644 144     99       182
2020-12-03 22:06:17.744 672     14       977
2020-12-03 22:06:17.844 655     13       87
2020-12-03 22:06:17.944 194     15       14
...

Data frame 2:

TIME                    SPEED   OTHER    OTHER2
...
2020-12-03 22:06:16     50      100      52
2020-12-03 22:06:17     60      130      22
2020-12-03 22:06:17     100     105      13
2020-12-03 22:06:17     24      99       104
2020-12-03 22:06:17     44      24       22
2020-12-03 22:06:17     66      20       257
2020-12-03 22:06:17     110     95       102
2020-12-03 22:06:17     90      22       24
2020-12-03 22:06:17     10      64       51
2020-12-03 22:06:17     11      66       61
2020-12-03 22:06:17     100     100      103
...

I am using the merge_asof method.

CODE THAT IS MERGING MY DATA FRAMES

merged_df = pd.merge_asof(ewholedf, wholedf.sort_values('TIME').assign(Mätarställning2=wholedf['TIME']), on='TIME', direction='nearest')
merged_df['TIME_Diff'] = merged_df['TIME'].sub(merged_df['TIME2']).abs()
merged_df.sort_values(by=['TIME2', 'TIME_Diff']).drop_duplicates(subset=['TIME2']).sort_index().drop(['TIME2', 'TIME_Diff'], axis=1)

OUTPUT

I cant attach the output as text here because of lenght and clarity so I will post an screenshot of the output in order to give you an better understanding of the situtation.

The data in the data frame examples are sample data which is why the data in the image looks different (it should not affect the output).

(separator: ",".)

In the image above you can see the red arrows marking the frequency of occurring duplicates and also the yellow marker showing that the time column is not merging the 10 hz data and keeping it 10 hz (there are more than 10 rows with the same second).

If my theory is right, please update me with a solution. If not, update me with the problem and the solution.

Hopefully this was enough information. Thank you in advance!

It might be due to the indices. Have you tried reset_index(drop=True) for both dataframes before merging? — Yashar Ahmadov, Nov 01 '21 at 17:07
Hey! Thank you for your response. I have tried doing it to the merged data frame but not the separate. I will try that! — Leopold Wahlbeck, Nov 02 '21 at 07:14
@YasharAhmadov I have now tried this method, leaving me with this error: `raise MergeError( pandas.errors.MergeError: Incompatible merge dtype, dtype('O') and dtype('O'), both sides must have numeric dtype` — Leopold Wahlbeck, Nov 02 '21 at 08:03
Please have a look at this answer and correct accordingly: https://stackoverflow.com/questions/16628819/convert-pandas-timezone-aware-datetimeindex-to-naive-timestamp-but-in-certain-t/34687479#34687479 — Yashar Ahmadov, Nov 02 '21 at 12:04
@YasharAhmadov I have looked threw the thread and am not sure how this is connected to my problem, yes it is the same type of error but not the same cause? If I have understood the thread right, the problem has to do with time zones and formatting? Im not sure how to apply the example `tz_localize('US/Eastern').sort_index()` to my code. — Leopold Wahlbeck, Nov 02 '21 at 13:41
Share a piece from your data and all the relevant code, please. Otherwise it is difficult to guess. — Yashar Ahmadov, Nov 05 '21 at 14:40

Duplicate rows when merging data frames Pandas

0 Answers0