2

I have to find a list of strings in a txt.file

The file has 200k+ lines

This is my code:

with open(txtfile, 'rU') as csvfile:
    tp = pd.read_csv(csvfile, iterator=True, chunksize=6000, error_bad_lines=False,
                     header=None, skip_blank_lines=True, lineterminator="\n")
    for chunk in tp:
        if string_to_find in chunk:
            print "hurrà"

The problem is that with this code only the first 9k lines are analyzed. Why?

K DawG
  • 13,287
  • 9
  • 35
  • 66
Massimo Variolo
  • 4,669
  • 6
  • 38
  • 64
  • 1
    hope this help. http://stackoverflow.com/questions/11622652/large-persistent-dataframe-in-pandas – BAE Nov 12 '15 at 14:46
  • Shouldn't it be `for chunk in pd.read_csv(csvfile, iterator=True, chunksize=6000, error_bad_lines=False, header=None, skip_blank_lines=True, lineterminator="\n"):`? – EdChum Nov 12 '15 at 14:46

1 Answers1

1

Do you really need to open the file first then use pandas? If it's an option you can just read with pandas then concatenate.

To do that just use read_csv, concat the files, then loop through them.

import pandas as pd

df = pd.read_csv('data.csv', iterator=True, chunksize=6000, error_bad_lines=False,
                 header=None, skip_blank_lines=True)
df = pd.concat(df)

# start the for loop

It depends on your for loop, pandas most likely will have a function that you won't need to loop as it's slower to process large data.

Leb
  • 15,483
  • 10
  • 56
  • 75