2

I got Killed after some running this code

part one of the code is

def load_data(distance_file):
    distance = {}
    min_dis, max_dis = sys.float_info.max, 0.0
    num = 0
    with open(distance_file, 'r', encoding = 'utf-8') as infile:
        for line in infile:
            content = line.strip().split()
           
            assert(len(content) == 3)
            idx1, idx2, dis = int(content[0]), int(content[1]), float(content[2])
            num = max(num, idx1, idx2)
            min_dis = min(min_dis, dis)
            max_dis = max(max_dis, dis)
            distance[(idx1, idx2)] = dis
            distance[(idx2, idx1)] = dis
        for i in range(1, num + 1):
            distance[(i, i)] = 0.0
        #infile.close() there are no need to close file it is closed automatically since i am using with
    
    return distance, num, max_dis, min_dis 

EDIT i tried this solution

bigfile = open(folder,'r')
        tmp_lines = bigfile.readlines(1024)
        while tmp_lines:
             for line in tmp_lines:
                tmp_lines = bigfile.readlines(1024)
                
                i, j, dis = line.strip().split()
                i, j, dis = int(i), int(j), float(dis)
                distance[(i, j)] = dis
                distance[(j, i)] = dis
                max_pt = max(i, j, max_pt)
             for num in range(1, max_pt + 1):
                distance[(num, num)] = 0
        return distance, max_pt

but got this error

   gap = distance[(i, j)] - threshold
KeyError: (1, 2)

from this method

def CutOff(self, distance, max_id, threshold):
        '''
        :rtype: list with Cut-off kernel values by desc
        '''
        cut_off = dict()
        for i in range(1, max_id + 1):
            tmp = 0
            for j in range(1, max_id + 1):
                gap = distance[(i, j)] - threshold
                print(gap)
                tmp += 0 if gap >= 0 else 1
            cut_off[i] = tmp
        sorted_cutoff = sorted(cut_off.items(), key=lambda k:k[1], reverse=True)
        return sorted_cutoff

i used print(gap) to get why this problem appeared and got this value -0.3

rest of the code here

I have a file contains 20000 lines and the code stopped at

['2686', '13856', '64.176689']
Killed

how can I handle the code to accept more lines? can I increase the memory and how or from the code itself need to change like using file for storing not parameters

I used dmesg and got

Out of memory: Killed process 24502 (python) total-vm:19568804kB, anon-rss:14542148kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:31232kB oom_score_adj:0

[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents o

 1000       24502    4892200          3585991    33763328     579936   
user1
  • 501
  • 2
  • 9
  • 24
  • Welcome to the world of limited computing resources. Try coding something useful in 8KB. – TomServo Feb 06 '21 at 22:43
  • it worked with less size but the actual data I need is around 2.7 GB :-( – user1 Feb 06 '21 at 22:55
  • 1
    Yes indeed I know your pain. Pandas and numpy are also notorious for being memory hogs. They are optimized for speed at the cost of size. You need to search for "out-of-core" solutions. At work my team used dask to process 600GB sets. – TomServo Feb 07 '21 at 00:16
  • i will but can i use google colab for this problem ? – user1 Feb 07 '21 at 00:30
  • I have no experience with that platform. My team used a 32-instance AWS EC2 cluster. Enough about that, this is all off-topic. – TomServo Feb 07 '21 at 01:46
  • Check out [this answer](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python) since it doesn't look like any of your lines depend on any of the other lines. – d_kennetz Feb 08 '21 at 15:04
  • i tried one of their solutions and didn't get killed after sometimes but got another error def CutOff(self, distance, max_id, threshold): '''def CutOff(self, distance, max_id, threshold): ''' – user1 Feb 08 '21 at 20:19
  • can you the edit post please ? – user1 Feb 08 '21 at 20:22

3 Answers3

0

On a Linux system, check the output of dmesg. If the process is getting killed by the kernel, there are and explanation. Most probable reason: out of memory.

Belhadjer Samir
  • 1,461
  • 7
  • 15
  • my laptop has cuda .. can i use it for increasing size ? – user1 Feb 06 '21 at 23:38
  • this probably isn't what you're looking for. GPUs are good for doing matrix manipulation, not the explicit handling of large files. For that you really need just more memory or some method of handling it in chunks. – Belhadjer Samir Feb 06 '21 at 23:57
  • so if i buy extra ram like 16 GB . it will be a solution? – user1 Feb 07 '21 at 00:02
0

one reason you might hit a memory limit is that your call to distance.values() in your auto_select_dc function

   neighbor_percent = sum([1 for value in distance.values() if value < dc]) / num ** 2

this will allocates a list that contains all the values from your dictionary. If your dictionary had a lot of data, this might be a very big list. A possible solution would be to use distance.iteritems() which is a generator. Rather than returning all the items in a list, it lets you iterate over them with much less memory usage.

   neighbor_percent = sum([1 for _,value in distance.iteritems() if value < dc]) / num ** 2
Belhadjer Samir
  • 1,461
  • 7
  • 15
  • thanks for helping. I think also a problem with this the first block of the code while reading from distance file the values. i tried to get those values to be store in the file but does this be a solution ? – user1 Feb 07 '21 at 19:33
0

The Cutoff function checks every (i, j) pairs, from 1 ~ max_id.

def CutOff(self, distance, max_id, threshold):
    for i in range(1, max_id + 1):
        for j in range(1, max_id + 1):

And a sample data file provided in the github link contains distance values for every ID pairs, from 1 to 2000. (so it has 2M lines for the 2K IDs).

However, your data seems to be very sparse, because it has only 20,000 lines but there are large ID numbers such as 2686 and 13856. The error message 'KeyError: (1, 2)' tells that there is no distance value between ID 1 and 2.

Finally, it does not make sense for me if some code loading only 20,000 lines of data (probably few MBytes) raises the out of memory error. I guess your data is much larger, or the OOM error came from another part of your code.

Purple
  • 34
  • 4