demo csv file:
label1 label2 m1
0 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0000_1 0.000000
1 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0001_1 1.000000
2 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0002_1 1.000000
3 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0003_1 1.414214
4 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0004_1 2.000000
5 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0005_1 2.000000
6 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0006_1 3.000000
7 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0007_1 3.162278
8 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0008_1 4.000000
9 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0009_1 5.000000
10 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0010_1 5.000000
11 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0011_1 6.000000
12 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0012_1 6.000000
13 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0013_1 6.000000
14 KeyT1_L1_1_animebook0000_1 KeyT1_L1_1_animebook0014_1 6.000000
From this CSV file, I will do some comparison operation. I will have a function which will make comparison and return minimum from the combination.
There are 160000 rows. Using pandas and for loop are taking a lot of time. Can I make it faster using dask? I tried dask dataframe from pandas but when I am using to_list which I can used for pandas column, it's giving me error. I have core i7 machine and ram of 128 gb Below is my code:
"""
#the purpose of this function is to calculate different rows...
#values for the m1 column of data frame. there could be two
#combinations and inside combination it needs to get m1 value for the row
#suppose first comb1 will calucalte sum of m1 value of #row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_1) and
#row(KeyT1_L1_1_animebook0000_1,KeyT1_L1_1_animebook0001_2)
a more details of this function could be found here:
(https://stackoverflow.com/questions/72663618/writing-a-python-function-to-get-desired-value-from-csv/72677299#72677299)
def compute(img1,img2):
comb1=(img1_1,img2_1)+(img1_1,img2_2)
comb2=(img1_2,img2_1)+(img1_2,img2_2)
return minimum(comb1,comb2)
"""
def min_4line(key1,key2,list1,list2,df):
k=['1','2','3','4']
indice_list=[]
key1_line1=key1+'_'+k[0]
key1_line2=key1+'_'+k[1]
key1_line3=key1+'_'+k[2]
key1_line4=key1+'_'+k[3]
key2_line1=key2+'_'+k[0]
key2_line2=key2+'_'+k[1]
key2_line3=key2+'_'+k[2]
key2_line4=key2+'_'+k[3]
ind1=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line1)].tolist()
ind2=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line2)].tolist()
ind3=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line3)].tolist()
ind4=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line4)].tolist()
comb1=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
ind1=df.index[(df['label1']==key1_line2) & (df['label2']==key2_line1)].tolist()
ind2=df.index[(df['label1']==key1_line3) & (df['label2']==key2_line2)].tolist()
ind3=df.index[(df['label1']==key1_line4) & (df['label2']==key2_line3)].tolist()
ind4=df.index[(df['label1']==key1_line1) & (df['label2']==key2_line4)].tolist()
comb2=int(df.loc[ind1,'m1'])+int(df.loc[ind2,'m1'])+int(df.loc[ind3,'m1'])+int(df.loc[ind4,'m1'])
return min(comb1,comb2)
Now, have to create unique list of labels to do the comparison:
list_line=list(df3['label1'].unique())
string_test=[a[:-2] for a in list_line]
#above list comprehension is done as we will get unique label like animebook0000_1,animebook0001_1
list_key=sorted(list(set(string_test)))
print(len(list_key))
#making list of those two column
lable1_list=df3['label1'].to_list()
lable2_list=df3['label2'].to_list()
Next, I will write the output of the comparison function in an excel
%%time
file = open("content\\dummy_metric.csv", "a")
file.write("label1,label2,m1\n")
c=0
for i in range(len(list_key)):
for j in range(i+1,len(list_key)):
a=min_4line(list_key[i],list_key[j] ,lable1_list,lable2_list,df3)
#print(a)
file.write(str(list_key[i]) + "," + str(list_key[j]) + "," + str(a)+ "\n")
c+=1
if c>20000:
print('20k done')
my expected output:
label1 label2 m1
0 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0001 2
1 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0002 2
2 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0003 2
3 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0004 4
4 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0005 5
5 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0006 7
6 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0007 9
7 KeyT1_L1_1_animebook0000 KeyT1_L1_1_animebook0008 13
For dask I was proceeding like this:
import pandas as pd
import dask.dataframe as dd
csv_gb=pd.read_csv("content\\four_metric.csv")
dda = dd.from_pandas(csv_gb, npartitions=10)
Upto that line is fine, but when I want to do the list of label like this
lable1_list=df3['label1'].to_list()
it's showing me this error:
2022-07-05 16:31:17,530 - distributed.worker - WARNING - Compute Failed
Key: ('unique-combine-5ce843b510d3da88b71287e6839d3aa3', 0, 1)
Function: execute_task
args: ((<function pipe at 0x0000022E39F18160>, [0 KeyT1_L1_1_animebook0000_1
.....
25 KeyT1_L1_1_animebook_002
kwargs: {}
Exception: 'TypeError("\'Serialize\' object is not callable")'
Is there any better way to perform the above mentioned code with dask? I am also curious about using dask distributed function like this for my task:
from dask.distributed import Client
client = Client()
client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='40GB')