My objective is to use "mock" file to normalise "in" file. The way it has to be done is that if an entry in the mock file is in the same group and its position in the interval between position start and position end I have to substract "mock" score from data_value
.
Below I present a simplified case, actual tables are much larger and my solution is not fast enough. I have been searching for alternatives but nothing so far seems to solve my problem. I am sure there is a faster way to solve this problem and hope someone can help me.
I have written this code that does exactly what I want:
import pandas as pd
test_in_dict = {'group': [1, 1, 1, 2, 2, 2],
'position_start' :[10,20,30, 40, 50, 60],
'position_end' : [15, 25, 35, 45, 55, 65],
'data_values' : [11, 12, 13, 14, 15, 16]}
test_in = pd.DataFrame(data=test_in_dict)
test_mock_dict = {'group_m': [1, 1, 1, 1, 2, 2, 2, 2],
'position_m' : [11, 16, 20, 52, 42, 47, 12, 65],
'score_m': [1, 1, 2, 1, 3, 1, 2, 1]}
test_mock = pd.DataFrame(data=test_mock_dict)
for index_in, row_in in test_in.iterrows():
for index_m, row_m in test_mock.iterrows():
if (row_in['group'] == row_m['group_m']) & \
(row_m['position_m'] >= row_in['position_start']) & \
(row_m['position_m'] < row_in['position_end']):
row_in['data_values'] = row_in['data_values'] - row_m['score_m']
How to write something that does the same as code above, but avoiding the double loop which leaves me in O(NxM) complexity with N and M both being large (mock file has many more entries than the in file)?