11

I am running a code that has always worked for me. This time I ran it on 2 .csv files: "data" (24 MB) and "data1" (475 MB). "data" has 3 columns of about 680000 elements each, whereas "data1" has 3 columns of 33000000 elements each. When I run the code, I get just "Killed: 9" after some 5 minutes of processing. If this is a memory problem, how to solve it?. Any suggestion is welcome !

This is the code:

import csv
import numpy as np

from collections import OrderedDict # to save keys order

from numpy import genfromtxt
my_data = genfromtxt('data.csv', dtype='S', 
                 delimiter=',', skip_header=1) 
my_data1 = genfromtxt('data1.csv', dtype='S', 
                 delimiter=',', skip_header=1) 

d= OrderedDict((rows[2],rows[1]) for rows in my_data)
d1= dict((rows[0],rows[1]) for rows in my_data1) 

dset = set(d) # returns keys
d1set = set(d1)

d_match = dset.intersection(d1) # returns matched keys

import sys  
sys.stdout = open("rs_pos_ref_alt.csv", "w") 

for row in my_data:
    if row[2] in d_match: 
        print [row[1], row[2]]

The header of "data" is:

    dbSNP RS ID Physical Position
0   rs4147951   66943738
1   rs2022235   14326088
2   rs6425720   31709555
3   rs12997193  106584554
4   rs9933410   82323721
5   rs7142489   35532970

The header of "data1" is:

    V2  V4  V5
10468   TC  T
10491   CC  C
10518   TG  T
10532   AG  A
10582   TG  T
albert
  • 8,027
  • 10
  • 48
  • 84
Lucas
  • 1,139
  • 3
  • 11
  • 23
  • 1
    Do you run this on your own computer or on some server? In case it runs on a server, maybe there's some script monitoring for processes that are"running amok", calling `kill -9` on them after some time. – tobias_k Dec 14 '15 at 13:43
  • Hi @tobias_k, I run it on my own laptop – Lucas Dec 14 '15 at 13:45
  • How do you "get" `Killed: 9"? In the standard output, or in an exception message? – Raimund Krämer Dec 14 '15 at 14:56
  • Hi @D.Everhard, in the standard output: Lucass-MacBook-Air:txt.long lucas$ python match_pos_snp.py Killed: 9 – Lucas Dec 14 '15 at 15:03

5 Answers5

13

Most likely kernel kills it because your script consumes too much of memory. You need to take different approach and try to minimize size of data in memory.

You may also find this question useful: Very large matrices using Python and NumPy

In the following code snippet I tried to avoid loading huge data1.csv into memory by processing it line-by-line. Give it a try.

import csv

from collections import OrderedDict # to save keys order

with open('data.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader) #skip header
    d = OrderedDict((rows[2], {"val": rows[1], "flag": False}) for rows in reader)

with open('data1.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader) #skip header
    for rows in reader:
        if rows[0] in d:
            d[rows[0]]["flag"] = True

import sys
sys.stdout = open("rs_pos_ref_alt.csv", "w")

for k, v in d.iteritems():
    if v["flag"]:
        print [v["val"], k]
Community
  • 1
  • 1
frizzby
  • 399
  • 1
  • 10
1

First off, create a python script and run the following code to find all Python processes.

import subprocess

wmic_cmd = """wmic process where "name='python.exe' or name='pythonw.exe'" get commandline,processid"""
wmic_prc = subprocess.Popen(wmic_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
wmic_out, wmic_err = wmic_prc.communicate()
pythons = [item.rsplit(None, 1) for item in wmic_out.splitlines() if item][1:]
pythons = [[cmdline, int(pid)] for [cmdline, pid] in pythons]
for line in pythons:
    cv = str(line).split('\\')
    cb=str(cv).strip('"')
    fin = cv[-1]
    if fin[0:11] != 'pythonw.exe':
        print 'pythonw.exe', fin
    if fin[0:11] != 'python.exe':
        print "'python.exe'", fin

After you have run it, paste the output, here in the questions section, where I will see a notification.

*EDIT

List all process and post them in your answer, use the following:

import psutil
for process in psutil.process_iter():
    print process
ajsp
  • 2,512
  • 22
  • 34
  • Thank you @ajsp, I run your code but I don't see any output, at least not in the terminal. The script seems to run fine though. – Lucas Dec 14 '15 at 15:12
  • There is process somewhere running -9 kill, if you find it you have found your culprit. Were you working on any other code recently? Did you write a script somewhere along the line that kills a PID number? – ajsp Dec 14 '15 at 15:16
  • Post the processes in your original answer! – ajsp Dec 14 '15 at 15:36
1

How much memory does your computer have?

You can add a couple of optimizations that will save some memory, and if that's not enough, you can trade-off some CPU and IO for better memory efficiency.

If you're only comparing the keys and don't really do anything with the values, you can extract only the keys:

d1 = set([rows[0] for rows in my_data1])

Then instead of OrderedDict, you can try using ordered set either from this answer -- Does python has ordered set or using ordered-set module from pypi.

Once you got all the intersecting keys, you can write another program that looks up all the matching values from source csv.

If these optimizations aren't enough, you can extract all the keys from the bigger set, save them into a file, then load keys one-by-one from the file using generators so the program you will only keep one set of keys plus one key instead of two sets.

Also I'd suggest using python pickle module for storing intermediate results.

Community
  • 1
  • 1
Alex Volkov
  • 2,812
  • 23
  • 27
0

in my case there was some process called syspolicy (consuming 90% CPU) or something like that, once i killed that process, running my command python3 no longer returned killed 9.

luky
  • 2,263
  • 3
  • 22
  • 40
0

Cleaning up some hard drive space worked for me.

Here is what I think was happening: like you, I was reading a huge numpy array; in my case it was over 10**10 numbers. Since it was too large to fit in RAM (I had 16 GB RAM), the operating system had to swap some of it to the hard drive. If there was not enough hard drive space, it crashed with a "killed: 9" error. Once I made some space, the program ran fine, under otherwise identical conditions and data.

A better alternative, if one has time, is to rewrite the program so it does not read in so much data at once!

Peter B
  • 453
  • 4
  • 15