1

I have very huge text file ( around 80G ). File contains only numbers(integers+floats) and has 20 columns. Now I have to analyze each column. By analyze I mean, I have to do some basic calculations on each column like finding mean, plotting histograms, check if condition is satisfied or not etc. I am reading file like following

with open(filename) as original_file:
        all_rows = [[float(digit) for digit in line.split()] for line in original_file]
    all_rows = np.asarray(all_rows)

After this I do all analysis on specific columns. I use 'good' configuration server/workstation (with 32GB RAM) to execute my program. Problem is that I am not able to finish my job. I waited almost day to finish that program but it was still running after 1 day. I had to kill it manually later on. I know my script is correct without any error because I have tried same script on smaller size files (around 1G) and it worked nicely.

My initial guess is it will have some memory problem. Is there any way I can run such job? Some different method or some other way ?

I tried splitting files into smaller file size and then analyzed them individually in loop like follows

pre_name = "split_file"   
for k in range(11):  #There are 10 files with almost 8G each
        filename = pre_name+str(k).zfill(3) #My files are in form "split_file000, split_file001 ..."
        with open(filename) as original_file:
            all_rows = [[float(digit) for digit in line.split()] for line in original_file]
        all_rows = np.asarray(all_rows)
        #Some analysis here
        plt.hist(all_rows[:,8],100)  #Plotting histogram for 9th Column
all_rows = None

I have tested above code on bunch of smaller files and it works fine. However again it was same problem when I used on big files. Any suggestions? Is there any other cleaner way to do it ?

moooeeeep
  • 31,622
  • 22
  • 98
  • 187
Dexter
  • 1,421
  • 3
  • 22
  • 43
  • If you only need to get a `hist`, there is no need to keep all of the `80G` data in your memory. – luoluo Sep 30 '15 at 08:20
  • `hist` is just example. I have to do other calculations also like finding mean, searching for specific value/condition etc. – Dexter Sep 30 '15 at 08:22
  • Well, What I want to say is that you should figure out whether you need to keep all of the `80G` data in your memory to do such kinds of calculations? – luoluo Sep 30 '15 at 08:25
  • Most of the time I need to check if value is above or bellow threshold and produce `hist` or `scatter` with two or more columns. I don't know whether this required all 80G or not. However even if I don't want to keep 80G in memory, how can I achieve this ? – Dexter Sep 30 '15 at 08:27
  • Take `hist` as an example, you only need to care about the occurrence of specified number. So there is no need to store all the copies of a specified number in memory, you just need one copy and the number of occurrence. Well, that's just one example. – luoluo Sep 30 '15 at 08:33
  • You would probably get quite far using some basic unix (`split`, `cut`), and numpy's direct input routines ([loadtxt](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html)) to load the data, instead of creating lists of lists. Using numpy, you can calculate the memory requirements like 'number of lines times number of columns times size of float'. – liborm Sep 30 '15 at 13:24

2 Answers2

3

For such lengthy operations (when data don't fit in memory), it might be useful to use libraries like dask ( http://dask.pydata.org/en/latest/ ), particularly the dask.dataframe.read_csv to read the data and then perform your operations like you would do in pandas library (another useful package to mention).

honza_p
  • 2,073
  • 1
  • 23
  • 37
  • Yes, few of my friends suggested pandas library. I tried following [this](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) post. This analysis is in progress. not successful yet, some bugs in code. I didn't know about `dask` though. I will check it out. Thanks – Dexter Sep 30 '15 at 08:31
  • Dask uses some "magic" to operate on pandas data in chunks while providing the same API. However, not all methods are ported yet, be aware! – honza_p Sep 30 '15 at 08:33
2

Two alternatives come to my mind:

  • You should consider performing your computation by online algorithms

    In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start.

    It is possible to compute mean and variance and a histogram with pre-specified bins this way with constant memory complexity.

  • You should throw your data into a proper database and make use of that database system's statistical and data handling capabilities, e.g., aggregate functions and indexes.

    Random links:

Community
  • 1
  • 1
moooeeeep
  • 31,622
  • 22
  • 98
  • 187