1

Suppose we have an array of a list containing indices. Each row (i.e. array) is associated to a specific user id. The algorithm only stores indices if the user appears more than once in the data, hence I use a filter function if the length is > 1 (user_split_indices=list(filter (lambda x: len(x)>1, user_split_indices)))

user_ind

I have calculated the permutation for each list in the array. Note that the last element in the list must not contain duplicates after generating the permutation - hence I use drop_duplicates(subset=u_data_len-1, keep="first") on prem_ dataframe.

def _groupby_user(user_indices, order):
    sort_kind = "mergesort" if order else "quicksort"
    users, user_position, user_counts = np.unique(user_indices,
                                                  return_inverse=True,
                                                  return_counts=True)
    user_split_indices = np.split(np.argsort(user_position, kind=sort_kind),
                                  np.cumsum(user_counts)[:-1])
    return user_split_indices

def split_by_num_new(data, k):
    temp_indices = pd.DataFrame()
    user_indices = data.user.to_numpy()
    user_split_indices = _groupby_user(user_indices, True)
    user_split_indices=list(filter (lambda x: len(x)>1, user_split_indices))
    
    for u_data in user_split_indices:
        u_data_len = len(u_data)             
        perm_ = pd.DataFrame(itertools.permutations(u_data)).drop_duplicates(subset=u_data_len-1, keep="first").set_index(u_data_len-1).stack().reset_index().rename(columns={'level_1': 'user_',u_data_len-1:'ind',k-1:'label_ind'})         
        temp_indices = pd.concat([temp_indices,perm_],axis=0)
    return temp_indices,user_split_indices

The function is called using the following code below:

data=data.reset_index()
temp_indices,user_ind = split_by_num_new(data,k=1)

The input data is shown below:

data

Note that the index must be resetted, so that the index in dataset matches output the dataframe after grouping the user column.

An example of the output table temp_indices:

temp_indices

The part of the code that I am trying to speed up is the loop in the def split_by_num_new(data, k) function when the data increases to over a 2 million rows:

for u_data in user_split_indices:
    u_data_len = len(u_data)             
    perm_ = pd.DataFrame(itertools.permutations(u_data)).drop_duplicates(subset=u_data_len-1, keep="first").set_index(u_data_len-1).stack().reset_index().rename(columns={'level_1': 'user_',u_data_len-1:'ind',k-1:'label_ind'})         
    temp_indices = pd.concat([temp_indices,perm_],axis=0)
return temp_indices,user_split_indices

Below is the outputs in details as well as the time breakdown:

def split_by_num_new(data, k):
    temp_indices = pd.DataFrame()
    user_indices = data.user.to_numpy()
    user_split_indices = _groupby_user(user_indices, True)
    user_split_indices=list(filter (lambda x: len(x)>1, user_split_indices))
    loop_start = timeit.default_timer()
    for u_data in user_split_indices:
        u_data_len = len(u_data)             
        perm_ = pd.DataFrame(itertools.permutations(u_data))
        print(perm_)
        perm_ = perm_.drop_duplicates(subset=u_data_len-1, keep="first")
        print(perm_)
        perm_=perm_.set_index(u_data_len-1)
        print(perm_)
        perm_=perm_.stack().reset_index()
        print(perm_)
        perm_=perm_.rename(columns={'level_1': 'user_',u_data_len-1:'ind',k-1:'label_ind'}) 
        print(perm_)   
        concat_start = timeit.default_timer()
        temp_indices = pd.concat([temp_indices,perm_],axis=0)
        concat_stop = timeit.default_timer()
        print('concat Time Completed at : ', concat_stop - concat_start)
    loop_stop = timeit.default_timer()
    print('Loop Time Completed at : ', loop_stop - loop_start)
    return temp_indices,user_split_indices

detailed output

Sade
  • 450
  • 7
  • 27
  • 1
    there are a number of answers on SO on `concat()` performance - one for example https://stackoverflow.com/questions/57000903/what-is-the-fastest-and-most-efficient-way-to-append-rows-to-a-dataframe – Rob Raymond May 01 '21 at 17:31
  • I will definitely test this out and provide an update – Sade May 02 '21 at 11:30
  • 1
    I think you'd probably have a better time dropping out of Pandas land and working with e.g. a dict of `{user_id: [index, index, ...]}`... – AKX May 03 '21 at 07:29
  • {10: 11, 11: 10} append Time Completed at : 5.000001692678779e-07 {12: 13, 13: 12} append Time Completed at : 3.000000106112566e-07 Loop Time Completed at : 0.0003059999999095453 Time Completed at : 0.0007989999999153952 – Sade May 03 '21 at 07:54
  • loop_start = timeit.default_timer() for u_data in user_split_indices: u_data_len = len(u_data) perm_ = dict(itertools.permutations(u_data)) print(perm_) append_start = timeit.default_timer() temp_indices.append(perm_) append_stop = timeit.default_timer() print('append Time Completed at : ', append_stop - append_start) loop_stop = timeit.default_timer() print('Loop Time Completed at : ', loop_stop - loop_start) return temp_indices,user_split_indices – Sade May 03 '21 at 07:55
  • I am just checking for any issues that come up for data cases when there multiples. Can I remove the loop?? – Sade May 03 '21 at 07:56
  • this is the error I get, if there are multiple indices : perm_ = dict(itertools.permutations(u_data)) ValueError: dictionary update sequence element #0 has length 3; 2 is required – Sade May 03 '21 at 08:17
  • perm_ = pd.DataFrame(itertools.permutations(u_data)) perm_ = perm_.set_index(perm_.shape[1]-1).to_dict() – Sade May 03 '21 at 09:38
  • It is the sub tuple that I need to flatten now – Sade May 03 '21 at 09:40
  • test data u_data=pd.DataFrame({0:[21,21,22,22,23,23], 1:[22,23,21,23,21,22], 2:[23,22,23,21,22,21]}) – Sade May 03 '21 at 09:41
  • output {0: {23: 22, 22: 23, 21: 23}, 1: {23: 21, 22: 21, 21: 22}} – Sade May 03 '21 at 09:42
  • How does one get it to this order {23: 22, 22: 23, 21: 23,23: 21, 22: 21, 21: 22} from the output above? – Sade May 03 '21 at 10:02
  • I have posted the updated function below. – Sade May 03 '21 at 11:06

1 Answers1

1

Function using dict:

def split_by_num_new(data, k):
    temp_indices = []
    user_indices = data.user.to_numpy()
    user_split_indices = _groupby_user(user_indices, True)
    user_split_indices=list(filter (lambda x: len(x)>1, user_split_indices))

    loop_start = timeit.default_timer()
    for u_data in user_split_indices:
        u_data_len = len(u_data)
        perm_ = pd.DataFrame(itertools.permutations(u_data))         
        p_ = perm_.set_index(perm_.shape[1]-1).to_dict()
        append_start = timeit.default_timer()
        temp_indices.append(p_)
        append_stop = timeit.default_timer()
        print('append Time Completed at : ', append_stop - append_start)
    loop_stop = timeit.default_timer()
    print('Loop Time Completed at : ', loop_stop - loop_start)
    return temp_indices

Call the function that uses dict:

temp_indices = split_by_num_new(data,k=1)
c = pd.DataFrame()
for ind in range(len(temp_indices)):
      print(ind)
      c= pd.concat([c,pd.DataFrame(temp_indices[ind][0].items())],axis=0)

append Time Completed at :  5.999991117278114e-07
append Time Completed at :  5.999991117278114e-07
Loop Time Completed at :  0.00579959999959101
Total time on on the entire dataset of (1311612, 60) takes : 3089.5768801999984 secs

Function using dict, map and lambda:

def func(u_data):
    perm_ = pd.DataFrame(itertools.permutations(u_data))         
    p_ = perm_.set_index(perm_.shape[1]-1).to_dict()
    return p_

def split_by_num_new(data, k):
    temp_indices = []
    user_indices = data.user.to_numpy()
    user_split_indices = _groupby_user(user_indices, True)
    user_split_indices=list(filter (lambda x: len(x)>1, user_split_indices))

    temp_indices = list(map(lambda i: func(i), user_split_indices))
    return temp_indices

Calling function using dict, map and lambda:

  f_start = timeit.default_timer() 
  temp_indices_ = split_by_num_new(data,k=1)
  function_time = timeit.default_timer()
  print('funct Time Completed at : ', function_time - f_start) 
  temp_indices= pd.DataFrame()
  for ind in range(len(temp_indices_)):
       # print(ind)
        temp_indices = pd.concat([temp_indices,pd.DataFrame(temp_indices_[ind[0].items())],axis=0)
  temp_indices = temp_indices.rename(columns={0:'ind',1:'label_ind'})

Total time on the entire dataset of (1311612, 60) takes : 2083.1114619 secs

Older function that uses pandas:

def split_by_num_pandas(data, k):
    temp_indices = pd.DataFrame()
    user_indices = data.user.to_numpy()
    user_split_indices = _groupby_user(user_indices, True)
    user_split_indices=list(filter (lambda x: len(x)>1, user_split_indices))
    loop_start = timeit.default_timer()
    for u_data in user_split_indices:
        u_data_len = len(u_data)             
        perm_ = pd.DataFrame(itertools.permutations(u_data)).drop_duplicates(subset=u_data_len-1, keep="first").set_index(u_data_len-1).stack().reset_index().rename(columns={'level_1': 'user_',u_data_len-1:'ind',k-1:'label_ind'})         
        concat_start = timeit.default_timer()
        temp_indices = pd.concat([temp_indices,perm_],axis=0)
        concat_stop = timeit.default_timer()
        print('concat Time Completed at : ', concat_stop - concat_start)
    loop_stop = timeit.default_timer()
    print('Loop Time Completed at : ', loop_stop - loop_start)
    return temp_indices,user_split_indices

Call function that uses pandas:

temp_indices_pd = split_by_num_pandas(data,k=1)
concat Time Completed at :  0.00038189999941096175
concat Time Completed at :  0.0004867000006925082
Loop Time Completed at :  0.011297000000922708
Sade
  • 450
  • 7
  • 27
  • For the to run on the entire dataset of (1311612, 60) takes : 3089.5768801999984 secs . This is 51 min 29.Therefore, it is not fully optimized. – Sade May 03 '21 at 13:20
  • Even though using map reduces the time to 2083.1114619 secs on the entire dataset, I still believe that this functions' speed can be optimized further. Therefore, I am still open further suggestions. – Sade May 04 '21 at 10:05