0

I'm looking to optimise the following problem [simplified version here]:

I have two data frames, the first contains the information.

user_id game_id score ON
1 1 450 1
1 2 200 1
1 3 400 1
2 1 225 1
2 2 150 1
2 3 200 1

The second contains the conditions.

game_id game_id_ref req_score type
2 1 150 1
3 1 200 1
1 1 400 2
3 2 175 1

The conditions should be evaluated on the information data frame in the following way.

The conditions with type == 1 describe TURN ON conditions, and enforce that a game can only TURN ON if the score on the game from the game_id_ref >= req_score, so the first row from the conditions should be read as; the game with game_id == 2 can only TURN ON for user X when they have a score of 150 or higher on the game with game_id == 1.

The conditions with type == 2 describe TURN OFF conditions, and enforce that a game must be TURNED OFF if the score on the game from the game_id_ref >= req_score, so the third row from the conditions should be read as; for user X the game with game_id == 1 must be TURNED OFF when they have a score of 400 or higher on the game with game_id == 1.

In the information data frame I have a column ON which indicates if a game is ON for a particular user. The default is 1 [the game is ON] but this is before evaluating all the conditions. I am looking for the fastest way to evaluate the conditions for each user separately, and return the same information data frame, however now with ON = 0 if for a user the game fails to meet criteria type 1 or met criteria type 2.

So for this mock example, the required output would be:

user_id game_id score ON
1 1 450 0
1 2 200 1
1 3 400 1
2 1 225 1
2 2 150 1
2 3 200 0

My current solution has been to create a separate function in which I check this by applying a for_loop over all the rows of the conditions table [approx 100 conditions], and using this function in a group_map function, on the information data frame grouped by the user_ids [approx 350000 unique users]. While this works relatively ok [approx 10 min], I would like to know if someone has a much faster solution for this.

Thanks!

Joost
  • 11
  • 2

1 Answers1

0

Probably you can fine-tune your solution to be a bit faster in R but without seeing your code it is hard to say. Your solution sounds quite reasonable to me already.

However, if you have so much data, this kind of problem can be solved faster with SQL. I assume you already use some data management system. SQL uses indexing to make JOIN very fast, which you can never achieve in R (unless you write a database management system in R, not recommended). After you join your information and condition data frame on the game_id column, you can check all the conditions which should be fast. That can also be done in SQL by the way.

Sorry if it is not the expected answer. If you are not familiar with SQL, and you feel like there is no way you want to learn a new technology for a simple question like this, please provide your code so far so we can see what could be improved

zerz
  • 160
  • 1
  • 10