How to use dask dataframe instead of pandas to make a faster calculation

Question

demo csv file:


      label1                              label2            m1
0   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0000_1  0.000000
1   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0001_1  1.000000
2   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0002_1  1.000000
3   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0003_1  1.414214
4   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0004_1  2.000000
5   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0005_1  2.000000
6   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0006_1  3.000000
7   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0007_1  3.162278
8   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0008_1  4.000000
9   KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0009_1  5.000000
10  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0010_1  5.000000
11  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0011_1  6.000000
12  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0012_1  6.000000
13  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0013_1  6.000000
14  KeyT1_L1_1_animebook0000_1  KeyT1_L1_1_animebook0014_1  6.000000

From this CSV file, I will do some comparison operation. I will have a function which will make comparison and return minimum from the combination.
There are 160000 rows. Using pandas and for loop are taking a lot of time. Can I make it faster using dask? I tried dask dataframe from pandas but when I am using to_list which I can used for pandas column, it's giving me error. I have core i7 machine and ram of 128 gb Below is my code:

"""
#the purpose of this function is to calculate different rows...
#values for the m1 column of data frame. there could be two 
#combinations and inside combination it needs to get m1 value for the row
#suppose first comb1 will calucalte sum of m1 value of #row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_1) and 
#row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_2)
a more details of this function could be found here: 
(https://stackoverflow.com/questions/72663618/writing-a-python-function-to-get-desired-value-from-csv/72677299#72677299)

def compute(img1,img2):
   comb1=(img1_1,img2_1)+(img1_1,img2_2)
   comb2=(img1_2,img2_1)+(img1_2,img2_2)
   return minimum(comb1,comb2)
"""
def min_4line(key1,key2,list1,list2,df):
  k=['1','2','3','4']
  indice_list=[]
  key1_line1=key1+'_'+k[0]
  key1_line2=key1+'_'+k[1]
  key1_line3=key1+'_'+k[2]
  key1_line4=key1+'_'+k[3]
  key2_line1=key2+'_'+k[0]
  key2_line2=key2+'_'+k[1]
  key2_line3=key2+'_'+k[2]
  key2_line4=key2+'_'+k[3]
  
  ind1=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line1)].tolist()
  ind2=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line2)].tolist()
  ind3=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line3)].tolist()
  ind4=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line4)].tolist()
  
  comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
  
  ind1=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line1)].tolist()
  ind2=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line2)].tolist()
  ind3=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line3)].tolist()
  ind4=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line4)].tolist()

  comb2=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
  return min(comb1,comb2)

Now, have to create unique list of labels to do the comparison:

list_line=list(df3['label1'].unique())
string_test=[a[:-2] for a in list_line]
#above list comprehension is done as we will get unique label like animebook0000_1,animebook0001_1
list_key=sorted(list(set(string_test)))
print(len(list_key))
#making list of those two column
lable1_list=df3['label1'].to_list()
lable2_list=df3['label2'].to_list()

Next, I will write the output of the comparison function in an excel

%%time
file = open("content\\dummy_metric.csv", "a")
file.write("label1,label2,m1\n")
c=0
for i in range(len(list_key)):
  for j in range(i+1,len(list_key)):
    a=min_4line(list_key[i],list_key[j]  ,lable1_list,lable2_list,df3)
    #print(a)
    file.write(str(list_key[i]) + "," + str(list_key[j])  + "," + str(a)+  "\n")
    c+=1
    if c>20000:
        print('20k done')

my expected output:


     label1                          label2                m1
0   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0001    2
1   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0002    2
2   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0003    2
3   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0004    4
4   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0005    5
5   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0006    7
6   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0007    9
7   KeyT1_L1_1_animebook0000    KeyT1_L1_1_animebook0008    13

For dask I was proceeding like this:

import pandas as pd
import dask.dataframe as dd
csv_gb=pd.read_csv("content\\four_metric.csv")
dda = dd.from_pandas(csv_gb, npartitions=10)

Upto that line is fine, but when I want to do the list of label like this
lable1_list=df3['label1'].to_list()
it's showing me this error:

    2022-07-05 16:31:17,530 - distributed.worker - WARNING - Compute Failed
Key:       ('unique-combine-5ce843b510d3da88b71287e6839d3aa3', 0, 1)
Function:  execute_task
args:      ((<function pipe at 0x0000022E39F18160>, [0     KeyT1_L1_1_animebook0000_1
.....
25    KeyT1_L1_1_animebook_002
kwargs:    {}
Exception: 'TypeError("\'Serialize\' object is not callable")'

Is there any better way to perform the above mentioned code with dask? I am also curious about using dask distributed function like this for my task:
from dask.distributed import Client
client = Client()
client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='40GB')

You're basically not using pandas at all here with all of these custom loops and list comprehensions... for loops are almost always the worst option when it comes to pandas. — BeRT2me, Jul 07 '22 at 00:40
Given your example input, what would your example output look like? What is it you're actually trying to do? — BeRT2me, Jul 07 '22 at 00:46
@BeRT2me, thanks for notifying, I have edited and showed the output that the code generates. I am calling pandas dataframe inside a function which is in the loop, can you propose any better way for it? — neo, Jul 07 '22 at 02:03
That doesn't really explain what you're doing. In words, what is `min_4line` supposed to do? — BeRT2me, Jul 07 '22 at 02:11
@BeRT2me, hi I edited the purpose of the function: it is searching some of particular index row's value and do some operations. Let me know have I given you correct explanation to understand it — neo, Jul 07 '22 at 03:42
Neo 160k rows is not large enough to warrant using dask. The problem with your code is the for loops and manual indexing. Rather than using dask, you should focus on improving your code with vectorized operations. I’d recommend finding a good online class on pandas and really learning how to use the library. — Michael Delgado, Jul 07 '22 at 03:56
@MichaelDelgado, hi the thing is it could also be 160000k, what can I do at that time, that kind of huge csv, would n't dask be a better choice still then? — neo, Jul 07 '22 at 13:59
No probably not. Dask recommends partition sizes of a couple hundred MBs. So when you get to having hundreds of millions or billions of cells, then using dask starts to make sense. But even then, step one would be to optimize your pandas code first. You could not loop over elements of your dask dataframe like this - that would be even slower than looping over elements of a pandas df. Dask ops need to be vectorized too. See the [dask best practices guide](https://docs.dask.org/en/stable/best-practices.html) - their first recommendation is to not use dask and just write better pandas code. — Michael Delgado, Jul 07 '22 at 14:25
@MichaelDelgado thanks for the answer, do u have any other suggestions where can I find help for this? I understand ur concept but i am kind of stuck here, couldn’t make any progress for optimizing it. — neo, Jul 07 '22 at 15:53
there are so many guides to pandas vectorization out there :) here's one https://towardsdatascience.com/data-science-coding-mistakes-and-best-practices-part-1-f7511cf573f7 — Michael Delgado, Jul 07 '22 at 16:02
this is going to take time and you're going to have to invest in this learning. there's no quick single lesson. you need to learn the pandas API and best practices, then rethink how you're approaching the problem from scratch. it probably won't be worth it for this task/assignment/project but it'll pay off big time in the long run. — Michael Delgado, Jul 07 '22 at 16:04
for this specific question, everyone is having a hard time understanding what you're trying to do. there's too much going on in there, and your descriptions are too general. you say "the purpose of this function is to calculate different rows" - but that's the purpose of literally every function in pandas. boil down your question to a single operation you want to speed up and ask about that. 1 at a time, and try to create a problem we can work with too. see the guide to creating a [mre] and [how to make good reproducible pandas examples](/q/20109391) — Michael Delgado, Jul 07 '22 at 16:07

How to use dask dataframe instead of pandas to make a faster calculation

0 Answers0