2

I'm dealing with large files that doesn't fit in memory, as a result of that I'm using the iterator functionality of Pandas' Dataframe and processing single chunk each time.

pd.read_csv(csv_file_name, encoding='utf-8', chunksize=chunk_size, iterator=True,
                                            engine='c', error_bad_lines=False, low_memory=False)

While processing I'd like to print the number of processed rows and the percentage of processed rows out of the total amount of rows.

To get the total amount of rows in a Pandas Dataframe I'm using

len(df.index)

But when trying to use it when using ierator I'm getting

AttributeError: 'TextFileReader' object has no attribute 'index'

Any way of doing that? (while not going over each chunk)

Lior Magen
  • 1,533
  • 2
  • 15
  • 33
  • 2
    You won't know about bad lines until you process the chunk and so at best you're only going to get an estimate of the final total. If an estimate is good enough might as well just print the number of lines in the csv: see https://stackoverflow.com/q/41553467/2750819 if you need help with that. – Kent Shikama Oct 28 '19 at 10:56

1 Answers1

0

Two possible work arounds I would use:

  1. Use the columns option and read the file in with just one column. It may be that is small enough you can read in one go, but if not iterate over that to count the number of rows.

  2. Use the linux command wc -l to count the number of lines. If you have a header you need to remove one. e.g.

wc_output = subprocess.run(['wc','-l', 'csv_file_name'])
# wc_output.stdout will be of format ` N_lines filename`
# subtract 1 to remove header
n_rows = int(wc_output.stdout.split()[0]) - 1
Robert King
  • 974
  • 5
  • 16
  • 1
    Kent Shikama's comment has a link to a question with some better suggestions than mine :-) I have upvoted his comment. – Robert King Oct 28 '19 at 10:59