I am new in Python, I had a program which loads one big CSV file where is over 100k lines, each line had 4 columns. In FOR loop I check for each row same duplicated list (dlist), this dlist is list of objects of DRef class which I load with another function
DsRef class:
from tqdm import tqdm
from multiprocessing import Pool, cpu_count, freeze_support
class DsRef:
def __init__(self, pn, comp, comp_name, type, diff):
self.pn = pn
self.comp = comp
self.comp_name = comp_name
self.type = type
self.diff = diff
def __str__(self):
return f'{self.pn} {get_red("|")} {self.comp} {get_red("|")} {self.comp_name} {get_red("|")} {self.type} {get_red("|")} {self.diff}\n'
def __repr__(self):
return str(self)
def __iter__(self):
return iter(self.__dict__.items())
Duplication class:
class Duplication:
def __init__(self, pn, comp, cnt):
self.pn = pn
self.comp = comp
self.cnt = cnt
def __str__(self):
return f'{self.pn};{self.comp};{self.cnt}\n'
def __repr__(self):
return str(self)
def __hash__(self):
return hash(('pn', self.pn,
'comp', self.comp))
def __eq__(self, other):
return self.pn == other.pn and self.comp == other.comp
Load data file sample for testing:
dlist= []
dlist.append(DsRef(
"TTT_XXX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XCX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XXX", "CCC_VCV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XXX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XYX", "CCC_YYY", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TAT_XQX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"ATT_XXX", "CCC_VQV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_EEE", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XWX", "CCC_VVV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_XXX", "CCC_VWV", "CTYPE", "CTYPE", "text"))
dlist.append(DsRef(
"TTT_EEE", "CCC_VVV", "CTYPE", "CTYPE", "text"))
Method to find and return rows where were duplicated values:
def FindDuplications(dlist):
duplicates = []
for pn, comp in enumerate(dlist):
matches = [xpn for xpn, xcomp in enumerate(dlist) if pn == xpn and comp == xcomp]
duplicates.append(Duplication(pn, comp, len(matches)))
return duplicates
row.pn == x.pn and row.comp == x.comp
if its true I find a duplication I compare first 2 parameters of each objech with each object in list
Now I try to use something like that for use all processor for a faster result, now it takes over 15 minutes
if __name__ == '__main__':
freeze_support()
p = Pool(cpu_count())
duplicates = p.map(FindDuplications, dlist)
p.close()
p.join()
In first I got an error when Class is not iterable then I create iter functions for first class, after that, I got an error then tuple object does not know pn or comp parameter, then I use in for enumerate(dlist) but still does not work
Could you please help me?
I would like also use TQDM to check the progress of processing function to find duplications
there is an original working function without use Multithreading:
def CheckDuplications(dlist):
print(get_yellow("========= CHECK CROSS DUPLICATIONS ========="))
duplicates = []
for r in tqdm(dlist):
matches = [x for x in dlist if r.pn == x.pn and r.comp == x.comp]
duplicates.append(Duplication(r.pn, r.comp, len(matches)))
results = [d for d in duplicates if d.cnt > 1]
results = set(results)
return results
From function FindDuplications I got list of DsRef objects (simple copy), but this must return list of Duplication objects, something is wrong
Thank you