I have very huge text file ( around 80G ). File contains only numbers(integers+floats) and has 20 columns. Now I have to analyze each column. By analyze I mean, I have to do some basic calculations on each column like finding mean, plotting histograms, check if condition is satisfied or not etc. I am reading file like following
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
After this I do all analysis on specific columns. I use 'good' configuration server/workstation (with 32GB RAM) to execute my program. Problem is that I am not able to finish my job. I waited almost day to finish that program but it was still running after 1 day. I had to kill it manually later on. I know my script is correct without any error because I have tried same script on smaller size files (around 1G) and it worked nicely.
My initial guess is it will have some memory problem. Is there any way I can run such job? Some different method or some other way ?
I tried splitting files into smaller file size and then analyzed them individually in loop like follows
pre_name = "split_file"
for k in range(11): #There are 10 files with almost 8G each
filename = pre_name+str(k).zfill(3) #My files are in form "split_file000, split_file001 ..."
with open(filename) as original_file:
all_rows = [[float(digit) for digit in line.split()] for line in original_file]
all_rows = np.asarray(all_rows)
#Some analysis here
plt.hist(all_rows[:,8],100) #Plotting histogram for 9th Column
all_rows = None
I have tested above code on bunch of smaller files and it works fine. However again it was same problem when I used on big files. Any suggestions? Is there any other cleaner way to do it ?