0

I have a large data file with 7000 lines (not very large though!) which looks like this:

    # data can be obtained from pastebin
    # filename = input.csv
    # lots of comments
    #           wave           flux            err
            0.807172    7.61973e-11    1.18177e-13
            0.807375    7.58666e-11    1.18288e-13
            0.807577    7.62136e-11    1.18504e-13
             0.80778    7.64491e-11    1.19389e-13
            0.807982    7.62858e-11    1.18685e-13
            0.808185    7.63852e-11    1.19324e-13
            0.808387    7.60547e-11    1.18952e-13
             0.80859    7.52287e-11    1.18016e-13
            0.808792    7.53114e-11    1.18979e-13
            0.808995    7.58247e-11    1.20198e-13
    # lots of other lines

Link to the input data: http://pastebin.com/KCW9phzX

I want to extract data for wavelength between 0.807375 and 0.807982.
So that the output looks like this:

#filename = output.csv
0.807375    7.58666e-11    1.18288e-13
0.807577    7.62136e-11    1.18504e-13
0.80778    7.64491e-11    1.19389e-13
0.807982    7.62858e-11    1.18685e-13     

Similar links are following:

https://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#=
efficient way to extract few lines of data from a large csv data file in python
What is the most efficient way to match list items to lines in a large file in Python?
Extract specific lines from file and create sections of data in python
how to extract elements from a list in python?
How to use numpy.genfromtxt when first column is string and the remaining columns are numbers?
genfromtxt and numpy

Community
  • 1
  • 1
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169
  • If you are comfortable with numpy/pandas, you could take a look at Dask which is part of the newer Blaze project - it's specifically designed to work with datasets that don't fit into memory. http://matthewrocklin.com/blog/work/2016/02/22/dask-distributed-part-2 – Jon Jun 04 '16 at 18:32
  • How large is the file? How many lines? Do you want a ```numpy``` solution or a non-```numpy``` solution? – wwii Jun 04 '16 at 18:32
  • Parse as few lines of the file as possible while using binary search to find the range of lines you want. Numpy won't help you much here. – Quinchilion Jun 04 '16 at 18:35
  • @wwii right now the file is 6700 lines, but, however, i have to deal with astronomical data (i am a astronomy student), so both numpy and non-numpy answers are welcome! – BhishanPoudel Jun 04 '16 at 19:32

3 Answers3

4

You could call np.genfromtxt(f, max_rows=chunksize) in a loop to read the file in chunks. This way you can retain the convenience and speed of NumPy arrays while controlling the amount of memory required by adjusting chunksize.

import numpy as np
import warnings
# genfromtxt warns if it encounters an empty file. Let's silence this warnings since 
# the code below handles it.
warnings.filterwarnings("ignore", message='genfromtxt', category=UserWarning)

# This reads 2 lines at a time
chunksize = 2
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
    while True:
        arr = np.genfromtxt(fin, max_rows=chunksize, usecols=(0,1,2), 
                            delimiter='', dtype=float)
        if not arr.any(): break
        arr = np.atleast_2d(arr)
        mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
        arr = arr[mask]

        # uncomment this print statement to confirm the file is being read in chunks
        # print('{}\n{}'.format(arr, '-'*80))
        np.savetxt(fout, arr, fmt='%g')

writes to out.csv:

0.807375 7.58666e-11 1.18288e-13
0.807577 7.62136e-11 1.18504e-13
0.80778 7.64491e-11 1.19389e-13
0.807982 7.62858e-11 1.18685e-13

For a large data file you'll of course want to increase chunksize to some integer much larger than 2. Generally you'll get the the best performance by choosing chunksize to be as large as possible while still operating on arrays that fit in RAM.


The code above is intended for large files. For a file with only 7000 lines,

import numpy as np
with open('data', 'rb') as fin, open('out.csv', 'w+b') as fout:
    arr = np.genfromtxt(fin, usecols=(0,1,2), delimiter='', dtype=float)
    mask = (arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)
    arr = arr[mask]
    np.savetxt(fout, arr, fmt='%g')

suffices.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • @Merlin: Doh! thanks for the correction :) -- I added `arr = arr[mask]`. – unutbu Jun 04 '16 at 18:48
  • @ubuntu, i got this error: Traceback (most recent call last): File "read_large_file.py", line 24, in mask = ((arr[:, 0] >= 0.807375) & (arr[:, 0] <= 0.807982)) IndexError: too many indices for array – BhishanPoudel Jun 04 '16 at 19:40
  • You might be getting that error if you added `unpack=True` to the call to `np.genfromtxt` since in that case you would get three 1D arrays, `col0`, `col1`, `col2` instead of one 2D array, `arr`. In that case, use `mask = ((col0 >= 0.807375) & (col0 <= 0.807982))`. – unutbu Jun 04 '16 at 20:12
  • @ubuntu I tried your solution with the data from pastebin (i edited the question and gave the link) but it showed Error: too many indices for array! – BhishanPoudel Jun 04 '16 at 20:28
  • @BhishanPoudel: Ah, you found a bug. If there is only one row left, then `np.genfromtxt` returns a 1D array. Since the data file has an odd number of rows, on the very last iteration, `arr` gets assigned to a 1-dimensional array and `arr[:, 0]` then raises `IndexError: too many indices for array`. To fix this, I added `arr = np.atleast_2d(arr)` to ensure `arr` is always 2-dimensional. – unutbu Jun 04 '16 at 20:41
  • @ubuntu Million thanks to you, now it works in flying colors! – BhishanPoudel Jun 04 '16 at 20:51
  • @ubuntu I have 6GB RAM on my laptop (with ubuntu 15.1) and the data file has about 7000 lines, what could be the more suitable chunksize rather than 2?, any suggestions? – BhishanPoudel Jun 04 '16 at 20:53
  • If `arr` has dtype `float64` -- i.e. 8-byte floats -- then you could (in theory) fit an array of shape (250000000, 3) in RAM, (since 6*10**9 / (3*8) = 250 million). This corresponds to a file with 250 million lines of data. Of course some memory is being used by the OS for other purposes, and many NumPy functions require enough space to make at least one copy of the array so in practice it is only practicable to load an array that is half as large (or less). So `chunksize` could be on the order of `125*10**6`. – unutbu Jun 04 '16 at 21:06
  • @BhishanPoudel, On chunksize- do nothing leave off flag. Let numpy handle it. FYI: 7000 row is not large! I am dealing with 3 million rows csv file thats 60 columns wide and I leave it alone. ( On 16GB boxes) You 6GB, the file cant be more an 1MB. Leave it alone. You gave the impression that the file was large it is not. – Merlin Jun 04 '16 at 21:14
  • @Merlin: You put your hand over the part you don't like and click the button ;) – unutbu Jun 04 '16 at 21:43
  • @Merlin, Your attempt still seems to have bug: raise ValueError("Usecols do not match names.") ValueError: Usecols do not match names. – BhishanPoudel Jun 06 '16 at 12:58
1

Try This:

import pandas as pd 

df         = pd.read_csv('large_data.csv', usecols=(0,1,2), skiprows=57)
df.columns = [ 'wave', 'flux' , 'err']
df         = df[(df['wave'] >=  0.807375) & (df['wave'] <=  0.807982) ]
print df 

     wave           flux              err
1   0.807375    7.586660e-11    1.182880e-13
2   0.807577    7.621360e-11    1.185040e-13
3   0.807780    7.644910e-11    1.193890e-13
4   0.807982    7.628580e-11    1.186850e-13

Since you have some line with unneeded text, you can use 'skiprows' flag on import. Also, pandas is built on top of numpy so there is chunksize flag

Merlin
  • 24,552
  • 41
  • 131
  • 206
  • i have pandas installed for python2, but i got this error, AttributeError: 'module' object has no attribute 'read' – BhishanPoudel Jun 04 '16 at 19:52
  • Just edited the above should get you started. 7000 rows isnt that big, Rule of thumb that works for me size on disk double size in memory. And, dont change chunksize , Pandas/Python is pretty effecient, Have wasted a lot time trying, never really fruitful, – Merlin Jun 04 '16 at 20:56
  • @BhishanPoudel Please use data from pastebin and try to eliminate the errror! I am not a hired dev for you, just trying to help you out. – Merlin Jun 04 '16 at 21:02
  • You need to learn how to debug the code, reuse elsewhere.. Walk through the code,. its pretty simple.. use print statements. Stop using "!", it suggests action needed. – Merlin Jun 04 '16 at 21:16
  • Pandas method looks short,easy and interesting, however it casts following error: raise ValueError("Usecols do not match names.") ValueError: Usecols do not match names. – BhishanPoudel Jun 06 '16 at 13:15
0

Reading the anwers by @ubuntu and @Merlin, The following could also be a good solution.

Note: The answer given by @ubuntu works absolutely fine.

The answer given by @Merlin doesnot work, is incomplete, but is a good template to start with.

Note: the input file input.csv can be obtained from pastebin:
http://pastebin.com/KCW9phzX

Using numpy :

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author    : Bhishan Poudel
# Date      : May 23, 2016


# Imports
import pandas as pd
import numpy as np


# using numpy
infile = 'input.csv'
outfile = 'output.csv'
lower_value = 0.807375
upper_value = 0.807982

print('{} {} {}'.format('Reading file    :', infile, ''))
print('{} {} {}'.format('Writing to file :', outfile, ''))

with open(infile, 'rb') as fin, open(outfile, 'w+b') as fout:
    arr = np.genfromtxt(fin, usecols=(0,1,2), delimiter='', dtype=float)
    mask = (arr[:, 0] >= lower_value) & (arr[:, 0] <= upper_value )
    arr = arr[mask]
    np.savetxt(fout, arr, fmt='%g')

Using pandas:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author    : Bhishan Poudel
# Date      : May 23, 2016


# Imports
import pandas as pd
import numpy as np


# extract range
infile = 'input.csv'
outfile = 'output.csv'
lower_value = 0.807375
upper_value = 0.807982


print('{} {} {}'.format('Reading file      :', infile, ''))
print('{} {} {}'.format('Writing to a file : ', outfile, ''))
df         = pd.read_csv(infile, usecols=(0,1,2), skiprows=57,sep='\s+')
df.columns = [ 'col0', 'col1' , 'col2']
df         = df[(df['col0'] >=  lower_value) & (df['col0'] <=  upper_value) ]
df.to_csv(outfile, header=None, index=None, mode='w', sep=' ')
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169