Loading in large dataset in python getting 1 of 2 errors - memory issue or exit code -1073741819 (0xC0000005)

Question

I am working with a dataset that is roughly a 6GB csv file. Obviously loading it straight into pandas throws up memory errors. I tried to process it with a chunksize of 1000000 but then it just says "Process finished with exit code -1073741819 (0xC0000005)" when I try to access any of the information.

I will post the code I have below:

Code:

import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
import psutil
import tqdm

f_path = "..\..\..\Desktop\march_test.csv"


svmem = psutil.virtual_memory()
print (svmem.available)

df_size = os.path.getsize(f_path)

print(df_size)


df_sample = pd.read_csv(f_path, nrows=10 )
df_sample_size = df_sample.memory_usage(index=True).sum

print(df_sample_size)

df = pd.read_csv(f_path, sep=',', nrows=5000000)

print(df.head())

df_chunk = pd.read_csv(f_path, headers=True, chunksize=1000000)

chunk_list = []  # append each chunk df here


for chunk in tqdm(df_chunk):
    # perform data filtering
    chunk_filter = chunk_preprocessing(chunk)

    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)


df_concat = pd.concat(chunk_list)

print(chunk_list)

output:

25124765696
3952642752
<bound method Series.sum of Index                        64
safegraph_place_id           40
location_name                40
street_address               40
city                         40
region                       40
postal_code                  80
brands                       40
naics_code                   80
date_range_start             40
date_range_end               40
raw_visit_counts             80
raw_visitor_counts           80
visits_by_day                40
visits_by_each_hour          40
visitor_home_cbgs            40
visitor_country_of_origin    40
distance_from_home           80
median_dwell                 80
bucketed_dwell_times         40
related_same_day_brand       40
related_same_week_brand      40
device_type                  40
iso_country_code             40
dtype: int64>

Process finished with exit code -1073741819 (0xC0000005)

**Thanks in advance for any help! **

*Obviously loading it straight into pandas throws up memory errors*? It shouldn't, even with 6GB files! Are you using the 64bit version of Python? Check like described [here](https://stackoverflow.com/a/43198680/104707) (procedure works on Windows). — amain, May 11 '20 at 16:43

Loading in large dataset in python getting 1 of 2 errors - memory issue or exit code -1073741819 (0xC0000005)

0 Answers0