I have 50 csv files makes 14.8 gb total. I converted all data to parquet files. Each file is sensor data that has millions of rows collected from patients. I have to make multi-class classification, I have 5 classes. I have 20 columns including class, My task is to group the data by class and then separate it into 6000 rows of 1 data. That is, to make data such as 6000x19 image data. And to train in CNN model using TensorFlow Keras. I am reading data using pandas i am reading 4 file and appending them to an empty pandas dataframe.
data_dir = r'/content/drive/MyDrive/uyku_parquet'
files = ['1', '2', '3', '5']
data = pd.DataFrame()
for file in files:
df = pd.read_parquet(data_dir+'/'+file+'.parquet')
data = data.append(df)
del df
then I am Label Encoding the class column.
siniflar_kodlanmamis = data["SleepStaging"].unique()
siniflar_kodlanmis = pd.DataFrame(data.SleepStaging.factorize()[0])[0].unique()
siniflar_map = dict(zip(siniflar_kodlanmis, siniflar_kodlanmamis))
data["SleepStaging"] = data.SleepStaging.factorize()[0]
After that i am grouping the same class rows and making them 6000x19 shaped data.
siniflar_map = {0: 'WK', 1: 'N1', 2: 'N2', 3: 'N3', 4: 'REM'}
data_tum = []
data_siniflar = []
for k in siniflar_map:
data_dizi = data[ data.SleepStaging == k ].to_numpy()
for i in range(int(data_dizi.shape[0]/6000)):
c=i+1
dat = data_dizi[(c-1)*6000:c*6000,1:20]
data_tum.append(dat)
data_siniflar.append(k)
X = np.array(data_tum)
y = np.array(data_siniflar)
del data
del data_tum
del data_siniflar
No problems so far. However, when training the model where there is less data from some classes, I get low accuracy, I need to read more data.
It is crashing when i am trying to read more data.
data_dir = r'/content/drive/MyDrive/uyku_parquet'
files = ['1', '2', '3', '5', '22','37','18']
data = pd.DataFrame()
for file in files:
df = pd.read_parquet(data_dir+'/'+file+'.parquet')
data = data.append(df)
del df
When I try to read more files, it runs out of ram. I tried it in Google Colab, there is a constant runtime crash. I tried it on my teacher's computer, it has 32 gb ram, unfortunately it crashes when I read too many files. In colab runtime ends with the message out of ram. In my teacher's computer jupyter notebook and computer freezes when i am trying to train the model.
What should I do and how should I proceed ?