I have a simple code, which reads csv file, looks for duplicates based on first 2 colums and then writes the duplicates in another csv and keeps unique values in third csv...
I am using set:
def my_func():
area = "W09"
inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
#i = 0
seen = set()
with open(inf, 'r') as infile, open(out, 'w') as outfile1, open(out2, 'w') as outfile2:
reader = csv.reader(infile, delimiter=" ")
writer1 = csv.writer(outfile1, delimiter=" ")
writer2 = csv.writer(outfile2, delimiter=" ")
for row in reader:
x, y = row[0], row[1]
x = float(x)
y = float(y)
if (x, y) in seen:
writer2.writerow(row)
continue
seen.add((x, y))
writer1.writerow(row)
seen.clear()
I thought, that set would be the best choice, but the size of the set is seven times the size of the input file? (input files ranging from 140 MB to 50GB csv) and RAM usage from 1GB to almost 400 GB (I am using a server with 768 GB of RAM):
I also used profiler on the small sample
Line # Mem usage Increment Line Contents
8 21.289 MiB 21.289 MiB @profile
9 def my_func():
10 21.293 MiB 0.004 MiB area = "W10"
11
12 21.293 MiB 0.000 MiB inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13 21.293 MiB 0.000 MiB out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14 21.297 MiB 0.004 MiB out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17
18 #i = 0
19 21.297 MiB 0.000 MiB seen = set()
20
21 21.297 MiB 0.000 MiB with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
22 21.297 MiB 0.000 MiB reader = csv.reader(infile, delimiter=" ")
23 21.297 MiB 0.000 MiB writer1 = csv.writer(outfile1, delimiter=" ")
24 21.297 MiB 0.000 MiB writer2 = csv.writer(outfile2, delimiter=" ")
25 1089.914 MiB -9.008 MiB for row in reader:
26 1089.914 MiB -7.977 MiB x, y = row[0], row[1]
27
28 1089.914 MiB -6.898 MiB x = float(x)
29 1089.914 MiB 167.375 MiB y = float(y)
30
31 1089.914 MiB 166.086 MiB if (x, y) in seen:
32 #z = line.split(" ",3)[-1]
33 #if z == "5284":
34 # print X, Y, z
35
36 1089.914 MiB 0.004 MiB writer2.writerow(row)
37 1089.914 MiB 0.000 MiB continue
38 1089.914 MiB 714.102 MiB seen.add((x, y))
39 1089.914 MiB -9.301 MiB writer1.writerow(row)
40
41
42
43 690.426 MiB -399.488 MiB seen.clear()
What could be the issue? is there a faster way to filter out the result? Or a way which uses less way RAM?
Sample of csv: We are looking at GeoTIFF converted into csv file so it is X Y Value
475596 101832 4926
475626 101832 4926
475656 101832 4926
475686 101832 4926
475716 101832 4926
475536 101802 4926
475566 101802 4926
475596 101802 4926
475626 101802 4926
475656 101802 4926
475686 101802 4926
475716 101802 4926
475746 101802 4926
475776 101802 4926
475506 101772 4926
475536 101772 4926
475566 101772 4926
475596 101772 4926
475626 101772 4926
475656 101772 4926
475686 101772 4926
475716 101772 4926
475746 101772 4926
475776 101772 4926
475806 101772 4926
475836 101772 4926
475476 101742 4926
475506 101742 4926
EDIT: So i tried the solution offered by Jean: https://stackoverflow.com/a/49008391/9418396
Result is that on my small set of 140 MB csv the size of set is now halfed, which is a good improvement. I will try to run it on the bigger data, and see what it does. I can't really link it to profiler, because the profiler prolongs the execution time by huge amount of time.
Line # Mem usage Increment Line Contents
8 21.273 MiB 21.273 MiB @profile
9 def my_func():
10 21.277 MiB 0.004 MiB area = "W10"
11
12 21.277 MiB 0.000 MiB inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13 21.277 MiB 0.000 MiB out = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14 21.277 MiB 0.000 MiB out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17 21.277 MiB 0.000 MiB seen = set()
18
19 21.277 MiB 0.000 MiB with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
20 21.277 MiB 0.000 MiB reader = csv.reader(infile, delimiter=" ")
21 21.277 MiB 0.000 MiB writer1 = csv.writer(outfile1, delimiter=" ")
22 21.277 MiB 0.000 MiB writer2 = csv.writer(outfile2, delimiter=" ")
23 451.078 MiB -140.355 MiB for row in reader:
24 451.078 MiB -140.613 MiB hash = float(row[0])*10**7 + float(row[1])
25 #x, y = row[0], row[1]
26
27 #x = float(x)
28 #y = float(y)
29
30 #if (x, y) in seen:
31 451.078 MiB 32.242 MiB if hash in seen:
32 451.078 MiB 0.000 MiB writer2.writerow(row)
33 451.078 MiB 0.000 MiB continue
34 451.078 MiB 78.500 MiB seen.add((hash))
35 451.078 MiB -178.168 MiB writer1.writerow(row)
36
37 195.074 MiB -256.004 MiB seen.clear()