5

With python's readlines() function I can retrieve a list of each line in a file:

with open('dat.csv', 'r') as dat:
    lines = dat.readlines()

I am working on a problem involving a very large file and this method is producing a memory error. Is there a pandas equivalent to Python's readlines() function? The pd.read_csv() option chunksize seems to append numbers to my lines, which is far from ideal.

Minimal example:

In [1]: lines = []

In [2]: for df in pd.read_csv('s.csv', chunksize = 100):
   ...:     lines.append(df)
In [3]: lines
Out[3]: 
[   hello here is a line
 0  here is another line
 1  here is my last line]

In [4]: with open('s.csv', 'r') as dat:
   ...:     lines = dat.readlines()
   ...:     

In [5]: lines
Out[5]: ['hello here is a line\n', 'here is another line\n', 'here is my last line\n']

In [6]: cat s.csv
hello here is a line
here is another line
here is my last line
Thanos
  • 2,472
  • 1
  • 16
  • 33
kilojoules
  • 9,768
  • 18
  • 77
  • 149
  • 1
    `pd.read_csv('dat.csv')`? – Woody Pride Mar 15 '16 at 19:46
  • read_csv returns a data frame separated by columns, not an array of the lines in the file – kilojoules Mar 15 '16 at 19:51
  • 1
    Possible duplicate of [pandas read csv file line by line](http://stackoverflow.com/questions/29334463/pandas-read-csv-file-line-by-line) – Munir Mar 15 '16 at 19:53
  • you can pass a `chunksize` param, this will return an iterable chunk of the file read into a df – EdChum Mar 15 '16 at 19:56
  • @munircontractor I included an example showing how these questions are different – kilojoules Mar 15 '16 at 19:56
  • 1
    you can append `df.values.tolist()` but why would you do this? – EdChum Mar 15 '16 at 20:04
  • My file is not a csv file - it doesn't make sense to store it as a dataframe. I am looking for a very specific pattern, line by line. – kilojoules Mar 15 '16 at 20:05
  • 2
    So why are you using pandas then, there is no performance gain with what you're intending to do by using pandas – EdChum Mar 15 '16 at 20:07
  • I am trying to use memory more efficiently in the context of an mpi-related issue https://stackoverflow.com/questions/36019498/mpi4py-comm-bcast-causes-memory-error-for-large-objects – kilojoules Mar 15 '16 at 20:10
  • 1
    So how does this approach solve this problem? Why does reading a single line at a time improve performance as opposed to reading N lines processing them and then reading a further N lines? – EdChum Mar 15 '16 at 20:17
  • Doesn't pandas read the csv using precompiled code, decreasing the memory used? – kilojoules Mar 15 '16 at 20:18
  • 1
    You're still reading text into some kind of memory structure, you're not intending to use dataframes anyway, just making a list of strings so where does pandas get involved in that step? – EdChum Mar 15 '16 at 20:21
  • My reasoning was that pandas should be able to capitalize on the mpi communicator and would use memory more intelligently. I do not have proof that this will work, but this strategy seemed like it could work. However, I am very open to other solutions :) – kilojoules Mar 15 '16 at 20:22
  • 1
    Your question as stated mentioned nothing about mpi, nor what you intended to achieve or proof that using pandas to simply produce a list of strings helps in anyway – EdChum Mar 15 '16 at 20:25
  • @EdChum I think I see what you're saying. If I use the `chunksize` parameter will my memory be used more efficiently? – kilojoules Mar 15 '16 at 20:35
  • 1
    it will read N lines into a df, you can then process that df, the additional numbers you see are the index values, you can ignore these if required using `df.values.tolist()` – EdChum Mar 15 '16 at 20:38

1 Answers1

18

You should try to use the chunksize option of pd.read_csv(), as mentioned in some of the comments.

This will force pd.read_csv() to read in a defined amount of lines at a time, instead of trying to read the entire file in one go. It would look like this:

>> df = pd.read_csv(filepath, chunksize=1, header=None, encoding='utf-8')

In the above example the file will be read line by line.

Now, in fact, according to the documentation of pandas.read_csv, it is not a pandas.DataFrame object that is being returned here, but a TextFileReader object instead.

  • chunksize : int, default None

Return TextFileReader object for iteration. See IO Tools docs for more information on iterator and chunksize.

Therefore, in order to complete the exercise, you would need to put this in a loop like this:

In [385]: cat data_sample.tsv
This is a new line
This is another line of text
And this is the last line of text in this file

In [386]: lines = []

In [387]: for line in pd.read_csv('./data_sample.tsv', encoding='utf-8', header=None, chunksize=1):
    lines.append(line.iloc[0,0])
   .....:     

In [388]: print(lines)
['This is a new line', 'This is another line of text', 'And this is the last line of text in this file']

I hope this helps!

Thanos
  • 2,472
  • 1
  • 16
  • 33