I have a csv file with with over 750000 row and 2 columns (SN, state).
The SN is serial number from 0 to ~ 750000 and the state is 0 or 1
I'm reading the csv file in pandas, then read .npy file with the same name as SN, then append the .npy files to 2 list named (x_train, x_val)
x_val should be 2000 element which 700 should be state = 1, and the rest is state = 0.
x_train should take the rest.
The problem is that after it reads about ~190000 rows the process stopped and the RAM is consumed (PC RAM = 32 GB)
x_train len= 195260
Size of list1: 1671784bytes
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
My code is :
nodules_path = "~/cropped_nodules/"
nodules_csv = pandas.read_csv("~/cropped_nodules_2.csv")
positive = 0
negative = 0
x_val = []
x_train = []
y_train = []
y_val = []
for nodule in nodules_csv.iterrows():
if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
positive += 1
x_val_img = str(nodule.SN) + ".npy"
x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
y_val.append(nodule.state)
elif nodule.state == 0 and negative <= 1300 and len(x_val) <= 2000:
x_val_img = str(nodule.SN) + ".npy"
negative += 1
x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
y_val.append(nodule.state)
else:
if len(x_train) % 10000 == 0:
gc.collect()
print("gc done")
x_train_img = str(nodule.SN) + ".npy"
x_train.append(np.load(os.path.join(nodules_path,x_train_img)))
y_train.append(nodule.state)
print("x_train len= ", len(x_train))
print("Size of list1: " + str(sys.getsizeof(x_train)) + "bytes")
I tried to use the following:
- calling gc manually
- using df.itertuples()
- using dataframe apply()
but the same problem happen after ~100000 rows.
I tried to do pandas vectorization, but I didn't know how to do it, and I think these conditions couldn't be done using vectorization.
Is there any better way to do it than this one ?
I tried to implement chunks as @XtianP suggested
with pandas.read_csv("~/cropped_nodules_2.csv", chunksize=chunksize) as reader:
for chunk in reader:
for index, nodule in chunk.iterrows():
if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
........
but the same exact problem happened! (maybe my implementation was not correct)
Maybe the problem is not with the pandas iterrows, but with the list getting oversize!
But the sys.getsizeof(x_train)
output is only Size of list1: 1671784bytes
I've used tracing memory allocation as follow:
import tracemalloc
tracemalloc.start()
# ... run your application ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
print("[ Top 10 ]")
for stat in top_stats[:10]:
print(stat)
the results are :
[ Top 10 ]
/home/mustafa/.local/lib/python3.8/site-packages/numpy/lib/format.py:741: size=11.3 GiB, count=204005, average=58.2 KiB
/home/mustafa/.local/lib/python3.8/site-packages/numpy/lib/format.py:771: size=4781 KiB, count=102002, average=48 B
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py:4855: size=2391 KiB, count=102000, average=24 B
/home/mustafa/home/mustafa/project/LUNAMASK/nodule_3D_CNN.py:84: size=806 KiB, count=2, average=403 KiB
/home/mustafa/home/mustafa/project/LUNAMASK/nodule_3D_CNN.py:85: size=805 KiB, count=1, average=805 KiB
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py:2056: size=78.0 KiB, count=2305, average=35 B
/usr/lib/python3.8/abc.py:102: size=42.5 KiB, count=498, average=87 B
/home/mustafa/.local/lib/python3.8/site-packages/numpy/core/_asarray.py:83: size=41.6 KiB, count=757, average=56 B
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py:512: size=37.5 KiB, count=597, average=64 B
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py:1880: size=16.5 KiB, count=5, average=3373 B