0

My objective is to use "mock" file to normalise "in" file. The way it has to be done is that if an entry in the mock file is in the same group and its position in the interval between position start and position end I have to substract "mock" score from data_value.

Below I present a simplified case, actual tables are much larger and my solution is not fast enough. I have been searching for alternatives but nothing so far seems to solve my problem. I am sure there is a faster way to solve this problem and hope someone can help me.

I have written this code that does exactly what I want:

import pandas as pd

test_in_dict = {'group': [1, 1, 1, 2, 2, 2], 
                'position_start' :[10,20,30, 40, 50, 60], 
                'position_end' : [15, 25, 35, 45, 55, 65], 
                'data_values' : [11, 12, 13, 14, 15, 16]}
test_in = pd.DataFrame(data=test_in_dict)

test_mock_dict = {'group_m': [1, 1, 1, 1, 2, 2, 2, 2], 
                  'position_m' : [11, 16, 20, 52, 42, 47, 12, 65], 
                  'score_m': [1, 1, 2, 1, 3, 1, 2, 1]}
test_mock = pd.DataFrame(data=test_mock_dict)

for index_in, row_in in test_in.iterrows():
    for index_m, row_m in test_mock.iterrows():
        if (row_in['group'] == row_m['group_m']) & \
        (row_m['position_m'] >= row_in['position_start']) & \
        (row_m['position_m'] < row_in['position_end']):
            row_in['data_values'] = row_in['data_values'] - row_m['score_m']

How to write something that does the same as code above, but avoiding the double loop which leaves me in O(NxM) complexity with N and M both being large (mock file has many more entries than the in file)?

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • Can't you use a dictionary to group rows by their `group` or `group_m`? – tobias_k Feb 22 '19 at 10:52
  • I might try that if there isn't a more elegant and robust solution using just the two data frames. It would help reduce complexity considerably i guess. – codeprimate123 Feb 22 '19 at 11:00

2 Answers2

2

What you want is a typical join problem. In pandas we use the merge method for this. You can rewrite your itterrows loops to this piece of code and it will be faster, since we use vectorized methods:

# first merge your two dataframes on the key column 'group' and 'group_m'
common = pd.merge(test_in, 
                    test_mock, 
                    left_on='group', 
                    right_on='group_m')

# after that filter the rows you need with the between method 
df_filter = common[(common.position_m >= common.position_start) & (common.position_m < common.position_end)]

# apply the calculation that is needed on column 'data_values'
df_filter['data_values'] = df_filter['data_values'] - df_filter['score_m']

# drop the columns we dont need
df_filter = df_filter[['group', 'position_start', 'position_end', 'data_values']].reset_index(drop=True)

# now we need to get the rows from the original dataframe 'test_in' which did not get filtered
unmatch = test_in[(test_in.group.isin(df_filter.group)) & (~test_in.position_start.isin(df_filter.position_start)) & (~test_in.position_end.isin(df_filter.position_end))]

# finally we can concat these two together
df_final = pd.concat([df_filter, unmatch], ignore_index=True)

Output





    group   position_start  position_end    data_values
0   1       10              15              10
1   1       20              25              10
2   2       40              45              11
3   1       30              35              13
4   2       50              55              15
5   2       60              65              16
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • please provide correct example output by editting your question. – Erfan Feb 22 '19 at 11:17
  • I check, give me a minute – Erfan Feb 22 '19 at 11:20
  • Modified data values look ok, only the non-modified values rows from test_in are missing in this output. – codeprimate123 Feb 22 '19 at 11:30
  • @codeprimate123 sorry for the mistake, this should work, please check. – Erfan Feb 22 '19 at 11:45
  • i am sorry for not including the output in the question, will do next time, this looks good, after an easy sort i get exactly what i need. Now ill modify the code for my real use case and see how fast it runs, my double loop hasn't finished in 3 hours... – codeprimate123 Feb 22 '19 at 11:59
  • Sounds good, please update us what the difference was in run time, im curious – Erfan Feb 22 '19 at 12:02
  • i am running the code however i have a following problem, when i run {common = pd.merge(test_in, test_mock, left_on='group', right_on='group_m')} my memory usage skyrockets to 50GB and more which is a problem considering i have 4 GB of RAM in my little MacBook Air – codeprimate123 Feb 22 '19 at 12:18
  • I see, thats another problem though, but there are couple of things you can do 1. split your data into chunks, process them and export them. 2. Look at `dask`, see this: https://stackoverflow.com/questions/37756991/best-way-to-join-two-large-datasets-in-pandas. – Erfan Feb 22 '19 at 12:26
  • Problem was that the way merge works identical key cause inflation of number of rows which can lead to memory overflow and kernel dies. So to help with this i used a simple trick, I made a list of all positions using def `getAllPositions(df): positions = [] for i, row in df.iterrows(): positions.extend(range(row['position_start'], row['position_end'])) return positions mock_in_position = mock[mock['position_start_m'].isin(positions)]` This reduced the number of rows enought for my poor lil mac not to crash. – codeprimate123 Feb 22 '19 at 13:23
  • Even after reducing rows from 4518991 to 81282 resulting common dataframe has 39872059, so imagine before that... – codeprimate123 Feb 22 '19 at 13:26
  • I see your problem. This problem raises when you have your type of `cardinality`, you have duplicates on both sides so you get this `many-to-many` duplication. – Erfan Feb 22 '19 at 13:54
  • @codeprimate123 I proposed a new solution, which might solve your problem. Could you please check if this works? – Erfan Feb 22 '19 at 14:43
0

The accepted answer is already inplace and should work, but because OP's data's huge, he cannot make the solution work. So I want to try an expirimental answer, that why Im adding this as another answer and not editing my already accepted answer:

Extra step to solution: As we see, the cardinality becomes many-to-many because there are duplicates in both key columns called group & group_m.

So I looked at the data and I see that each position_start values is rouned to base 10. So we can reduce the cardinality by making a artificial key column in our second df 'test_mock' called position_m_round like the following:

# make a function which rounds integers to the nearest base 10
def myround(x, base=10):
    return int(base * round(float(x)/base))

# apply this function to our 'position_m' column and create a new key column to join
test_mock['position_m_round'] = test_mock.position_m.apply(lambda x: myround(x))

    group_m position_m  score_m position_m_round
0   1       11          1       10
1   1       16          1       20
2   1       20          2       20
3   1       52          1       50
4   2       42          3       40

# do the merge again, but now we reduce cardinality because we have two keys to join
common = pd.merge(test_in, 
                    test_mock, 
                    left_on=['group', 'position_start'],
                    right_on=['group_m', 'position_m_round'])

'''
this part becomes the same as the original answer
'''

# after that filter the rows you need with the between method 
df_filter = common[(common.position_m >= common.position_start) & (common.position_m < common.position_end)]

# apply the calculation that is needed on column 'data_values'
df_filter['data_values'] = df_filter['data_values'] - df_filter['score_m']

# drop the columns we dont need
df_filter = df_filter[['group', 'position_start', 'position_end', 'data_values']].reset_index(drop=True)

# now we need to get the rows from the original dataframe 'test_in' which did not get filtered
unmatch = test_in[(test_in.group.isin(df_filter.group)) & (~test_in.position_start.isin(df_filter.position_start)) & (~test_in.position_end.isin(df_filter.position_end))]

# finally we can concat these two together
df_final = pd.concat([df_filter, unmatch], ignore_index=True)

Output

    group   position_start  position_end    data_values
0   1       10              15              10
1   1       20              25              10
2   2       40              45              11
3   1       30              35              13
4   2       50              55              15
5   2       60              65              16
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • I made the test data rounded so it would be easier to calculate in my head, in practice the positions, position_starts and position_end are huge integers, not rounded as in my little case. I found that I can make your solution work by reducing the mock data size by pre-filtering using a single iterrows() loop. However since i can still see cases where the RAM usage could explode and i would wish for a robust solution i will use a tool designed for calculating intersections (it uses more optimal data structures, not tables, but inputs and outputs can be tables). Thank you, I learned a lot! – codeprimate123 Feb 22 '19 at 15:03
  • I understand that, in this case that would not work indeed. Good luck and no thanks! – Erfan Feb 22 '19 at 15:15