0

I have a csv file with with over 750000 row and 2 columns (SN, state).
The SN is serial number from 0 to ~ 750000 and the state is 0 or 1
I'm reading the csv file in pandas, then read .npy file with the same name as SN, then append the .npy files to 2 list named (x_train, x_val)
x_val should be 2000 element which 700 should be state = 1, and the rest is state = 0. x_train should take the rest.
The problem is that after it reads about ~190000 rows the process stopped and the RAM is consumed (PC RAM = 32 GB)

x_train len=  195260
Size of list1: 1671784bytes
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

My code is :

nodules_path = "~/cropped_nodules/"
nodules_csv = pandas.read_csv("~/cropped_nodules_2.csv")

positive = 0
negative = 0
x_val = []
x_train = []
y_train = []
y_val = []

for nodule in nodules_csv.iterrows():

    if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
        positive += 1
        x_val_img = str(nodule.SN) + ".npy"
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    elif nodule.state == 0 and negative <= 1300 and len(x_val) <= 2000:
        x_val_img = str(nodule.SN) + ".npy"
        negative += 1
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    else:

        if len(x_train) % 10000 == 0:
            gc.collect()
            print("gc done")
        x_train_img = str(nodule.SN) + ".npy"
        x_train.append(np.load(os.path.join(nodules_path,x_train_img)))
        y_train.append(nodule.state)
        print("x_train len= ", len(x_train))
        print("Size of list1: " + str(sys.getsizeof(x_train)) + "bytes")

I tried to use the following:

  1. calling gc manually
  2. using df.itertuples()
  3. using dataframe apply()

but the same problem happen after ~100000 rows.
I tried to do pandas vectorization, but I didn't know how to do it, and I think these conditions couldn't be done using vectorization.
Is there any better way to do it than this one ?

I tried to implement chunks as @XtianP suggested

with pandas.read_csv("~/cropped_nodules_2.csv", chunksize=chunksize) as reader:
    for chunk in reader:
        for index, nodule in chunk.iterrows():
            if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
........

but the same exact problem happened! (maybe my implementation was not correct)

Maybe the problem is not with the pandas iterrows, but with the list getting oversize!
But the sys.getsizeof(x_train) output is only Size of list1: 1671784bytes

I've used tracing memory allocation as follow:

import tracemalloc

tracemalloc.start()

# ... run your application ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("[ Top 10 ]")
for stat in top_stats[:10]:
    print(stat)

the results are :

[ Top 10 ]
/home/mustafa/.local/lib/python3.8/site-packages/numpy/lib/format.py:741: size=11.3 GiB, count=204005, average=58.2 KiB
/home/mustafa/.local/lib/python3.8/site-packages/numpy/lib/format.py:771: size=4781 KiB, count=102002, average=48 B
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py:4855: size=2391 KiB, count=102000, average=24 B
/home/mustafa/home/mustafa/project/LUNAMASK/nodule_3D_CNN.py:84: size=806 KiB, count=2, average=403 KiB
/home/mustafa/home/mustafa/project/LUNAMASK/nodule_3D_CNN.py:85: size=805 KiB, count=1, average=805 KiB
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py:2056: size=78.0 KiB, count=2305, average=35 B
/usr/lib/python3.8/abc.py:102: size=42.5 KiB, count=498, average=87 B
/home/mustafa/.local/lib/python3.8/site-packages/numpy/core/_asarray.py:83: size=41.6 KiB, count=757, average=56 B
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py:512: size=37.5 KiB, count=597, average=64 B
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py:1880: size=16.5 KiB, count=5, average=3373 B
  • I suggest you to have a look to [this post](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) – XtianP Apr 25 '21 at 19:05
  • @XtianP Thank you for your suggestion, I tried to implement it as follow: with pandas.read_csv("/home/mustafa/project/LUNA16/cropped_nodules_2.csv", chunksize=chunksize) as reader: for chunk in reader: for index, nodule in chunk.iterrows(): but still got the same problem x_train len= 190142 Size of list1: 1671784bytes Process finished with exit code 137 (interrupted by signal 9: SIGKILL) – Mustafa Mahmood Apr 25 '21 at 20:16
  • 1
    Is it an option to replace your `append`s with appending to a file? – Acccumulation Apr 25 '21 at 20:24
  • 1
    @Acccumulation no problem, but what type of file should I save it to? Because I need to load it again and feed it to CNN. I'm reading 3D imgs from npy files, and append the imgs to a list to create training and testing data. – Mustafa Mahmood Apr 25 '21 at 20:47
  • If I understand correctly, your nodules are already in separate files. Instead of loading them all to add them to x_val why not put in x_val just the path of the object? Then load the objects lazily, one by one, only when you need them? – XtianP Apr 26 '21 at 05:14
  • It's quite possible that I'm missing something, but I don't see why you would need all of your training data in memory at the same time. My understanding is that neural nets are trained by feeding input layer, you should be able to compress each image to the features that layer looks at. – Acccumulation Apr 26 '21 at 18:15

0 Answers0