1

I have large csv file ~14gb (124 columns) and i occur memory error while reading df = pd.read_csv(r'C:\Users\AdamPer\Desktop\Python\Magisterka\test2.csv', encoding= "utf_8_sig") I tried set option low_memory = False and error_bad_lines = Falsebut it doesnt help, so i decided to set dtype and have problem with that. What i've done.

I made a smaller csv file ~16mb and read it to dataframe and check types of column df.info(max_cols=200)

Soft              39347 non-null object
Hand_ID           39347 non-null int64
Table_Name        39345 non-null object
SmallBlind        39347 non-null float64
BigBlind          39347 non-null float64
Currency          39347 non-null object
Day               39347 non-null object
Hour              39347 non-null object
Seat_1            39347 non-null object
Seat_2            39347 non-null object
Seat_3            39347 non-null object
Seat_4            39347 non-null object
Seat_5            39347 non-null object
Seat_6            39347 non-null object
Stack_1           39347 non-null float64
Stack_2           39347 non-null float64
Stack_3           39347 non-null float64
Stack_4           39347 non-null float64
Stack_5           39347 non-null float64
Stack_6           39347 non-null float64
Raise_Pre_S1      39347 non-null object
Raise_Pre_S2      39347 non-null object
Raise_Pre_S3      39347 non-null object
Raise_Pre_S4      39347 non-null object
Raise_Pre_S5      39347 non-null object
Raise_Pre_S6      39347 non-null object
Call_Pre_S1       39347 non-null object
Call_Pre_S2       39347 non-null object
Call_Pre_S3       39347 non-null object
Call_Pre_S4       39347 non-null object
Call_Pre_S5       39347 non-null object
Call_Pre_S6       39347 non-null object
Flop_Bet_S1       39347 non-null float64
Flop_Bet_S2       39347 non-null float64
Flop_Bet_S3       39347 non-null float64
Flop_Bet_S4       39347 non-null float64
Flop_Bet_S5       39347 non-null float64
Flop_Bet_S6       39347 non-null float64
Flop_Raise_S1     39347 non-null object
Flop_Raise_S2     39347 non-null object
Flop_Raise_S3     39347 non-null object
Flop_Raise_S4     39347 non-null object
Flop_Raise_S5     39347 non-null object
Flop_Raise_S6     39347 non-null object
Flop_Call_S1      39347 non-null object
Flop_Call_S2      39347 non-null object
Flop_Call_S3      39347 non-null object
Flop_Call_S4      39347 non-null object
Flop_Call_S5      39347 non-null object
Flop_Call_S6      39347 non-null object
Saw_Flop_S1       39347 non-null int64
Saw_Flop_S2       39347 non-null int64
Saw_Flop_S3       39347 non-null int64
Saw_Flop_S4       39347 non-null int64
Saw_Flop_S5       39347 non-null int64
Saw_Flop_S6       39347 non-null int64
Turn_Bet_S1       39347 non-null float64
Turn_Bet_S2       39347 non-null float64
Turn_Bet_S3       39347 non-null float64
Turn_Bet_S4       39347 non-null float64
Turn_Bet_S5       39347 non-null float64
Turn_Bet_S6       39347 non-null float64
Turn_Raise_S1     39347 non-null object
Turn_Raise_S2     39347 non-null object
Turn_Raise_S3     39347 non-null object
Turn_Raise_S4     39347 non-null object
Turn_Raise_S5     39347 non-null object
Turn_Raise_S6     39347 non-null object
Turn_Call_S1      39347 non-null object
Turn_Call_S2      39347 non-null object
Turn_Call_S3      39347 non-null object
Turn_Call_S4      39347 non-null object
Turn_Call_S5      39347 non-null object
Turn_Call_S6      39347 non-null object
Saw_Turn_S1       39347 non-null int64
Saw_Turn_S2       39347 non-null int64
Saw_Turn_S3       39347 non-null int64
Saw_Turn_S4       39347 non-null int64
Saw_Turn_S5       39347 non-null int64
Saw_Turn_S6       39347 non-null int64
River_Bet_S1      39347 non-null float64
River_Bet_S2      39347 non-null float64
River_Bet_S3      39347 non-null float64
River_Bet_S4      39347 non-null float64
River_Bet_S5      39347 non-null float64
River_Bet_S6      39347 non-null float64
River_Raise_S1    39347 non-null object
River_Raise_S2    39347 non-null object
River_Raise_S3    39347 non-null object
River_Raise_S4    39347 non-null object
River_Raise_S5    39347 non-null object
River_Raise_S6    39347 non-null object
River_Call_S1     39347 non-null object
River_Call_S2     39347 non-null object
River_Call_S3     39347 non-null object
River_Call_S4     39347 non-null object
River_Call_S5     39347 non-null object
River_Call_S6     39347 non-null object
Saw_River_S1      39347 non-null int64
Saw_River_S2      39347 non-null int64
Saw_River_S3      39347 non-null int64
Saw_River_S4      39347 non-null int64
Saw_River_S5      39347 non-null int64
Saw_River_S6      39347 non-null int64
S1_shows?         39347 non-null int64
S2_shows?         39347 non-null int64
S3_shows?         39347 non-null int64
S4_shows?         39347 non-null int64
S5_shows?         39347 non-null int64
S6_shows?         39347 non-null int64
Winner?_S1        39347 non-null int64
Winner?_S2        39347 non-null int64
Winner?_S3        39347 non-null int64
Winner?_S4        39347 non-null int64
Winner?_S5        39347 non-null int64
Winner?_S6        39347 non-null int64
W/L_amount_S1     39347 non-null float64
W/L_amount_S2     39347 non-null float64
W/L_amount_S3     39347 non-null float64
W/L_amount_S4     39347 non-null float64
W/L_amount_S5     39347 non-null float64
W/L_amount_S6     39347 non-null float64
Pot               39347 non-null float64
Rake              39347 non-null float64

According to that i set dtypes:

dtypes = {'Soft': np.object,
          'Hand_ID': np.int64,
          'Table_Name': np.object,
          'SmallBlind': np.float64,
          'BigBlind': np.float64,
          'Currency': np.object,
          'Day': np.object,
          'Hour': np.object,
          'Seat_1': np.object, 'Seat_2': np.object, 'Seat_3': np.object, 'Seat_4': np.object, 'Seat_5': np.object, 'Seat_6': np.object,
          'Stack_1': np.float64, 'Stack_2': np.float64, 'Stack_3': np.float64, 'Stack_4': np.float64, 'Stack_5': np.float64, 'Stack_6': np.float64,
'Raise_Pre_S1': np.object, 'Raise_Pre_S2': np.object, 'Raise_Pre_S3': np.object, 'Raise_Pre_S4': np.object, 'Raise_Pre_S5': np.object, 'Raise_Pre_S6': np.object,
'Call_Pre_S1': np.object, 'Call_Pre_S2': np.object, 'Call_Pre_S3': np.object, 'Call_Pre_S4': np.object, 'Call_Pre_S5': np.object, 'Call_Pre_S6': np.object,
'Flop_Bet_S1': np.float64, 'Flop_Bet_S2': np.float64, 'Flop_Bet_S3': np.float64, 'Flop_Bet_S4': np.float64, 'Flop_Bet_S5': np.float64, 'Flop_Bet_S6': np.float64,
'Flop_Raise_S1': np.object, 'Flop_Raise_S2': np.object, 'Flop_Raise_S3': np.object, 'Flop_Raise_S4': np.object, 'Flop_Raise_S5': np.object, 'Flop_Raise_S6': np.object,
'Flop_Call_S1': np.object, 'Flop_Call_S2': np.object, 'Flop_Call_S3': np.object, 'Flop_Call_S4': np.object, 'Flop_Call_S5': np.object, 'Flop_Call_S6': np.object, 
'Saw_Flop_S1': np.int64, 'Saw_Flop_S2': np.int64, 'Saw_Flop_S3': np.int64, 'Saw_Flop_S4': np.int64, 'Saw_Flop_S5': np.int64, 'Saw_Flop_S6': np.int64,
'Turn_Bet_S1': np.float64, 'Turn_Bet_S2': np.float64, 'Turn_Bet_S3': np.float64, 'Turn_Bet_S4': np.float64, 'Turn_Bet_S5': np.float64, 'Turn_Bet_S6': np.float64,
'Turn_Raise_S1': np.object, 'Turn_Raise_S2': np.object, 'Turn_Raise_S3': np.object, 'Turn_Raise_S4': np.object, 'Turn_Raise_S5': np.object, 'Turn_Raise_S6': np.object,
'Turn_Call_S1': np.object, 'Turn_Call_S2': np.object, 'Turn_Call_S3': np.object, 'Turn_Call_S4': np.object, 'Turn_Call_S5': np.object, 'Turn_Call_S6': np.float64,
'Saw_Turn_S1': np.int64, 'Saw_Turn_S2': np.int64, 'Saw_Turn_S3': np.int64, 'Saw_Turn_S4': np.int64, 'Saw_Turn_S5': np.int64, 'Saw_Turn_S6': np.int64,
'River_Bet_S1': np.float64,'River_Bet_S2': np.float64,'River_Bet_S3': np.float64,'River_Bet_S4': np.float64,'River_Bet_S5': np.float64,'River_Bet_S6': np.float64,
'River_Raise_S1': np.object, 'River_Raise_S2': np.object,'River_Raise_S3': np.object, 'River_Raise_S4': np.object, 'River_Raise_S5': np.object, 'River_Raise_S6': np.object,
'River_Call_S1': np.object, 'River_Call_S2': np.object, 'River_Call_S3': np.object, 'River_Call_S4': np.object, 'River_Call_S5': np.object, 'River_Call_S6': np.object,
'Saw_River_S1': np.int64,'Saw_River_S2': np.int64,'Saw_River_S3': np.int64,'Saw_River_S4': np.int64,'Saw_River_S5': np.int64, 'Saw_River_S6': np.int64,
'S1_shows?': np.int64, 'S2_shows?': np.int64, 'S3_shows?': np.int64, 'S4_shows?': np.int64, 'S5_shows?': np.int64, 'S6_shows?': np.int64,
'Winner?_S1': np.int64, 'Winner?_S2': np.int64, 'Winner?_S3': np.int64, 'Winner?_S4': np.int64, 'Winner?_S5': np.int64, 'Winner?_S6': np.int64,
'W/L_amount_S1': np.float64, 'W/L_amount_S2': np.float64, 'W/L_amount_S3': np.float64, 'W/L_amount_S4': np.float64, 'W/L_amount_S5': np.float64, 'W/L_amount_S6': np.float64,
'Pot': np.float64,
'Rake': np.float64}

and try to read the same csv with this code:

df = pd.read_csv(r'C:\Users\AdamPer\Desktop\Python\Magisterka\test2.csv', encoding= "utf_8_sig", dtype=dtypes)

and it raise me an error:

ValueError: could not convert string to float: '[]'

Any ideas how to solve this? Link to smaller csv file

  • 1
    So how much physical ram do you have? Will this even fit into memory? – EdChum May 21 '19 at 08:16
  • I have 12gb ram and i'm using Python 64bit – PerczynskiAdam May 21 '19 at 08:20
  • @Ciamciaramcia Do you get the same error even if you use a small chunk of the original .csv file (let's say 100MB)? – balkon16 May 21 '19 at 08:22
  • 1
    Well that should answer your question then, you either, read in chunks, work on those chunks and write the result out, or look at pyTables see: https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas. Basically it will not work trying to read this whole file into memory. Besides looking at your last error you have some duff data which you need to sort out – EdChum May 21 '19 at 08:22
  • When i read 16mb csv file with similar data as ~14gb without dtype argument there is no error. When i read either 16mb and 14gb csv with dtype there is the same ValueError: – PerczynskiAdam May 21 '19 at 08:25

0 Answers0