0

I have 50 csv files makes 14.8 gb total. I converted all data to parquet files. Each file is sensor data that has millions of rows collected from patients. I have to make multi-class classification, I have 5 classes. I have 20 columns including class, My task is to group the data by class and then separate it into 6000 rows of 1 data. That is, to make data such as 6000x19 image data. And to train in CNN model using TensorFlow Keras. I am reading data using pandas i am reading 4 file and appending them to an empty pandas dataframe.

data_dir = r'/content/drive/MyDrive/uyku_parquet'
files = ['1', '2', '3', '5'] 
data = pd.DataFrame()
for file in files:
   df = pd.read_parquet(data_dir+'/'+file+'.parquet')
   data = data.append(df)
   del df

then I am Label Encoding the class column.

siniflar_kodlanmamis = data["SleepStaging"].unique()
siniflar_kodlanmis = pd.DataFrame(data.SleepStaging.factorize()[0])[0].unique()

siniflar_map = dict(zip(siniflar_kodlanmis, siniflar_kodlanmamis))

data["SleepStaging"] = data.SleepStaging.factorize()[0]

After that i am grouping the same class rows and making them 6000x19 shaped data.

siniflar_map = {0: 'WK', 1: 'N1', 2: 'N2', 3: 'N3', 4: 'REM'}

data_tum = []
data_siniflar = []
for k in siniflar_map:
  data_dizi = data[ data.SleepStaging == k ].to_numpy()
  for i in range(int(data_dizi.shape[0]/6000)):
    c=i+1
    dat = data_dizi[(c-1)*6000:c*6000,1:20]
    data_tum.append(dat)
    data_siniflar.append(k)

X = np.array(data_tum)
y = np.array(data_siniflar)
del data
del data_tum
del data_siniflar

No problems so far. However, when training the model where there is less data from some classes, I get low accuracy, I need to read more data.

It is crashing when i am trying to read more data.

data_dir = r'/content/drive/MyDrive/uyku_parquet'
    files = ['1', '2', '3', '5', '22','37','18'] 
    data = pd.DataFrame()
    for file in files:
       df = pd.read_parquet(data_dir+'/'+file+'.parquet')
       data = data.append(df)
       del df

When I try to read more files, it runs out of ram. I tried it in Google Colab, there is a constant runtime crash. I tried it on my teacher's computer, it has 32 gb ram, unfortunately it crashes when I read too many files. In colab runtime ends with the message out of ram. In my teacher's computer jupyter notebook and computer freezes when i am trying to train the model.

What should I do and how should I proceed ?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Where exactly is it crashing and with what message? Please add the error tracebacks. (You're also saying "no problems so far" – you should show the code that _is_ having the problem.) – AKX Jul 20 '22 at 11:09
  • @AKX I've edited the question can you look that again. – AbuMuhandisAlTurki Jul 20 '22 at 11:19
  • Have you tried `pd.concat(..., copy=False)` instead of the deprecated `data.append()`? If the issue is loading the raw data, it might make sense to put it in an SQL database where Pandas doesn't need to do as much in-memory manipulation itself and you could just query for the partition you need. – AKX Jul 20 '22 at 11:27
  • You might be able to avoid the costly concatenation too with `ParquetDataset` and a list of filenames, see https://stackoverflow.com/a/56586590/51685 – AKX Jul 20 '22 at 11:32
  • @AKX i tried pd.concat it also do the same as append, ram crash. I read the link you sent but i did not understand how to implement because they are reading the data from s3 and i am reading data from google drive. – AbuMuhandisAlTurki Jul 20 '22 at 12:14
  • Just don't do anything S3-specific: `filenames = [f'/content/drive/MyDrive/uyku_parquet/{x}.parquet' for x in ['1', '2', '3', '5', '22', '37', '18']; df = pq.ParquetDataset(filenames).read_pandas().to_pandas()` ? – AKX Jul 20 '22 at 12:18
  • @AKX thank you I tried it and it allowed me to read 3-4 more datasets but after that i also got crash because of ram. I need to read more datasets. What should i do. – AbuMuhandisAlTurki Jul 20 '22 at 12:26
  • Well, the easiest is, of course, to buy more memory. The harder option would be to forego Pandas altogether and just work with Numpy arrays, which should probably be more efficient. ([Someone](https://betterprogramming.pub/i-forked-asyncpg-and-it-parses-database-records-to-numpy-20x-faster-e71024a84bff) recently made it possible to directly load Numpy data from PostgreSQL, too...) – AKX Jul 20 '22 at 12:32

0 Answers0