2

I have a simple code, which reads csv file, looks for duplicates based on first 2 colums and then writes the duplicates in another csv and keeps unique values in third csv...

I am using set:

def my_func():
    area = "W09"

    inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
    out  = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
    out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"



    #i = 0
    seen = set()

    with open(inf, 'r') as infile, open(out, 'w') as outfile1, open(out2, 'w') as outfile2:
        reader = csv.reader(infile, delimiter=" ")
        writer1 = csv.writer(outfile1, delimiter=" ")
        writer2 = csv.writer(outfile2, delimiter=" ")
        for row in reader:
            x, y = row[0], row[1]

            x = float(x)
            y = float(y)

            if (x, y) in seen:

                writer2.writerow(row)
                continue
            seen.add((x, y))
            writer1.writerow(row)



    seen.clear()

I thought, that set would be the best choice, but the size of the set is seven times the size of the input file? (input files ranging from 140 MB to 50GB csv) and RAM usage from 1GB to almost 400 GB (I am using a server with 768 GB of RAM):

I also used profiler on the small sample

Line #    Mem usage    Increment   Line Contents

 8   21.289 MiB   21.289 MiB   @profile
 9                             def my_func():
10   21.293 MiB    0.004 MiB       area = "W10"
11
12   21.293 MiB    0.000 MiB       inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13   21.293 MiB    0.000 MiB       out  = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14   21.297 MiB    0.004 MiB       out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17
18                                 #i = 0
19   21.297 MiB    0.000 MiB       seen = set()
20
21   21.297 MiB    0.000 MiB       with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
22   21.297 MiB    0.000 MiB           reader = csv.reader(infile, delimiter=" ")
23   21.297 MiB    0.000 MiB           writer1 = csv.writer(outfile1, delimiter=" ")
24   21.297 MiB    0.000 MiB           writer2 = csv.writer(outfile2, delimiter=" ")
25 1089.914 MiB   -9.008 MiB           for row in reader:
26 1089.914 MiB   -7.977 MiB               x, y = row[0], row[1]
27
28 1089.914 MiB   -6.898 MiB               x = float(x)
29 1089.914 MiB  167.375 MiB               y = float(y)
30
31 1089.914 MiB  166.086 MiB               if (x, y) in seen:
32                                             #z = line.split(" ",3)[-1]
33                                             #if z == "5284":
34                                             #    print X, Y, z
35
36 1089.914 MiB    0.004 MiB                   writer2.writerow(row)
37 1089.914 MiB    0.000 MiB                   continue
38 1089.914 MiB  714.102 MiB               seen.add((x, y))
39 1089.914 MiB   -9.301 MiB               writer1.writerow(row)
40
41
42
43  690.426 MiB -399.488 MiB       seen.clear()

What could be the issue? is there a faster way to filter out the result? Or a way which uses less way RAM?

Sample of csv: We are looking at GeoTIFF converted into csv file so it is X Y Value

    475596 101832 4926
    475626 101832 4926
    475656 101832 4926
    475686 101832 4926
    475716 101832 4926
    475536 101802 4926
    475566 101802 4926
    475596 101802 4926
    475626 101802 4926
    475656 101802 4926
    475686 101802 4926
    475716 101802 4926
    475746 101802 4926
    475776 101802 4926
    475506 101772 4926
    475536 101772 4926
    475566 101772 4926
    475596 101772 4926
    475626 101772 4926
    475656 101772 4926
    475686 101772 4926
    475716 101772 4926
    475746 101772 4926
    475776 101772 4926
    475806 101772 4926
    475836 101772 4926
    475476 101742 4926
    475506 101742 4926

EDIT: So i tried the solution offered by Jean: https://stackoverflow.com/a/49008391/9418396

Result is that on my small set of 140 MB csv the size of set is now halfed, which is a good improvement. I will try to run it on the bigger data, and see what it does. I can't really link it to profiler, because the profiler prolongs the execution time by huge amount of time.

Line #    Mem usage    Increment   Line Contents

 8   21.273 MiB   21.273 MiB   @profile
 9                             def my_func():
10   21.277 MiB    0.004 MiB       area = "W10"
11
12   21.277 MiB    0.000 MiB       inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13   21.277 MiB    0.000 MiB       out  = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14   21.277 MiB    0.000 MiB       out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17   21.277 MiB    0.000 MiB       seen = set()
18
19   21.277 MiB    0.000 MiB       with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
20   21.277 MiB    0.000 MiB           reader = csv.reader(infile, delimiter=" ")
21   21.277 MiB    0.000 MiB           writer1 = csv.writer(outfile1, delimiter=" ")
22   21.277 MiB    0.000 MiB           writer2 = csv.writer(outfile2, delimiter=" ")
23  451.078 MiB -140.355 MiB           for row in reader:
24  451.078 MiB -140.613 MiB               hash = float(row[0])*10**7 + float(row[1])
25                                         #x, y = row[0], row[1]
26
27                                         #x = float(x)
28                                         #y = float(y)
29
30                                         #if (x, y) in seen:
31  451.078 MiB   32.242 MiB               if hash in seen:
32  451.078 MiB    0.000 MiB                   writer2.writerow(row)
33  451.078 MiB    0.000 MiB                   continue
34  451.078 MiB   78.500 MiB               seen.add((hash))
35  451.078 MiB -178.168 MiB               writer1.writerow(row)
36
37  195.074 MiB -256.004 MiB       seen.clear()
Jan Doležal
  • 175
  • 1
  • 8
  • Why are you converting the first two columns to `float`? Would they be better off as `int`? – quamrana Feb 27 '18 at 11:50
  • 1
    because some files have coordinates with decimal value, lets say 2345641.5 – Jan Doležal Feb 27 '18 at 11:54
  • Possible duplicate of [Python very large set. How to avoid out of memory exception?](https://stackoverflow.com/questions/15555728/python-very-large-set-how-to-avoid-out-of-memory-exception) – mkrieger1 Feb 27 '18 at 12:04
  • You can save significant time by putting seen.add((x, y)) in else statement. The value should be added to set only when it is not present. Adding a value to a set is costly operation. – gautamaggarwal Feb 27 '18 at 12:04
  • 2
    @gautamaggarwal `seen.add` is effectively in an `else` because of the `continue`. Also the issue here is not time, but memory. – mkrieger1 Feb 27 '18 at 12:06
  • Gosh... My bad.. I missed it. Also may be you could use bulk writing to the files? https://stackoverflow.com/questions/4961589/speed-up-writing-to-files. I agree time is not issue. Was just suggesting some optimizations. – gautamaggarwal Feb 27 '18 at 12:08

1 Answers1

2

you could create your own hash function to avoid storing the tuple of floats, but a float value holding the floats combined together in an unique way.

Let's say that coordinates cannot exceed 10 million (maybe you could go down to 1 million), you could do:

hash = x*10**7 + y

(this performs a kind of logical "OR" on your floats, and since values are limited, there's no mixing up between x and y)

Then put hash in your set instead of a tuple of floats. There's no risk of float absorption with 10**14 that'd be worth a try:

>>> 10**14+1.5
100000000000001.5

the loop then becomes:

    for row in reader:
        hash = float(row[0])*10**7 + float(row[1])

        if hash in seen:
            writer2.writerow(row)
            continue
        seen.add(hash)
        writer1.writerow(row)

one float, even big (since the size of a float is fixed), is at least 2 or 3 times smaller in memory than a tuple of 2 floats. On my machine:

>>> sys.getsizeof((0.44,0.2))
64
>>> sys.getsizeof(14252362*10**7+35454555.0)
24
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219