I am working with a dataset that is roughly a 6GB csv file. Obviously loading it straight into pandas throws up memory errors. I tried to process it with a chunksize of 1000000 but then it just says "Process finished with exit code -1073741819 (0xC0000005)" when I try to access any of the information.
I will post the code I have below:
Code:
import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
import psutil
import tqdm
f_path = "..\..\..\Desktop\march_test.csv"
svmem = psutil.virtual_memory()
print (svmem.available)
df_size = os.path.getsize(f_path)
print(df_size)
df_sample = pd.read_csv(f_path, nrows=10 )
df_sample_size = df_sample.memory_usage(index=True).sum
print(df_sample_size)
df = pd.read_csv(f_path, sep=',', nrows=5000000)
print(df.head())
df_chunk = pd.read_csv(f_path, headers=True, chunksize=1000000)
chunk_list = [] # append each chunk df here
for chunk in tqdm(df_chunk):
# perform data filtering
chunk_filter = chunk_preprocessing(chunk)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk_filter)
df_concat = pd.concat(chunk_list)
print(chunk_list)
output:
25124765696
3952642752
<bound method Series.sum of Index 64
safegraph_place_id 40
location_name 40
street_address 40
city 40
region 40
postal_code 80
brands 40
naics_code 80
date_range_start 40
date_range_end 40
raw_visit_counts 80
raw_visitor_counts 80
visits_by_day 40
visits_by_each_hour 40
visitor_home_cbgs 40
visitor_country_of_origin 40
distance_from_home 80
median_dwell 80
bucketed_dwell_times 40
related_same_day_brand 40
related_same_week_brand 40
device_type 40
iso_country_code 40
dtype: int64>
Process finished with exit code -1073741819 (0xC0000005)
**Thanks in advance for any help! **