2

I have this code reading a text file with headers. ANd append another file with the same headers to it. As the main file is very huge, I only want to read in part of it and get the column headers. I will get this error if the only line there is the header. And I do not have an idea of how many rows the file has. What I would like to achieve is to read in the file and get the column header of the file. Because I want to append another file to it, I am trying to ensure that the columns are correct.

    import pandas as pd
    main = pd.read_csv(main_input, nrows=1)
    data = pd.read_csv(file_input)
    data = data.reindex_axis(main.columns, axis=1)
    data.to_csv(main_input,
                quoting=csv.QUOTE_ALL,
                mode='a', header=False, index=False)

Examine the stack trace:

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 420, in parser_f
    return _read(filepath_or_buffer, kwds)
    File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 221, in _read
    return parser.read(nrows)
    File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 626, in read
    ret = self._engine.read(nrows)
    File "C:\Users\gohm\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1070, in read
    data = self._reader.read(nrows)
    File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\parser.c:7110)
    File "parser.pyx", line 774, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7671)
    StopIteration
Kenneth Goh
  • 123
  • 2
  • 8
  • Can you provide the 1st couple of rows from an example main_input file and some details about the file_input? Do you get exactly the smae behaviour _only_ running the `main = pd.read_csv(main_input, nrows=1)` line? You might also try `pd.read_csv(main_input, nrows=1, header=None)` as this will read *only* the header row rather than the header and the 1st row of data. – Laurence Billingham Aug 27 '14 at 06:51
  • You may well already know this, but using main as the name of a `DataFrame` is incompatible with the common `if __name__ == '__main__': main()` idiom for modules which is useful for re-use and unit-testing. – Laurence Billingham Aug 27 '14 at 09:23

1 Answers1

2

It seems that the whole file may be being read into memory. You can specify a chunksize= in read_csv(...) as discussed in the docs here.

I think that read_csvs memory usage had been overhauled in version 0.10. So pandas your version makes a difference too see this answer from @WesMcKinney and the associated comments. The changes were also discussed a while ago on Wes' blog

import pandas as pd 
from cStringIO import StringIO

csv_data = """\
header, I want
0.47094534,  0.40249001,
0.45562164,  0.37275901,
0.05431775,  0.69727892,
0.24307614,  0.92250565,
0.85728819,  0.31775839,
0.61310243,  0.24324426,
0.669575  ,  0.14386658,
0.57515449,  0.68280618,
0.58448533,  0.51793506,
0.0791515 ,  0.33833041,
0.34361147,  0.77419739,
0.53552098,  0.47761297,
0.3584255 ,  0.40719249,
0.61492079,  0.44656684,
0.77277236,  0.68667805,
0.89155627,  0.88422355,
0.00214914,  0.90743799
"""

tfr = pd.read_csv(StringIO(csv_data), header=None, chunksize=1)
main = tfr.get_chunk()
Community
  • 1
  • 1