1

I am creating a number of pandas dataframes from a csv file, each in excess of 50k lines. Each line has 45 fields. In the process, I occasionally come across a line with more than 45 fields. To use the data, the only option that I have found is to skip the lines with "error_bad_lines", i.e.

devdata=pd.read_csv(devfile,sep="|",error_bad_lines=False, names=devcolnames,usecols=[0,5,6,8,25])

I am only interested in five fields, the last of which is 25, which is not affected by the difference in length in some of the lines. Is there anything that I can do with a pandas dataframe to read in even the incomplete lines, or must I resort to a list?

Thanks in advance!

Edition after Dan's assistance:

One thing that I found after experimenting with Dan's direction-- if the iterator/chunk method is used, from the post:

large persistent dataframe in pandas

yet you wish to use usecol (in my case because of memory concerns), the columns may be selected in the pd.concat line:

txtdata=pd.concat([chunk[txtcolnames] for chunk in tdata1],ignore_index=True)
Community
  • 1
  • 1
  • Just to elaborate on the answer I linked, you should determine the length of the longest line. (Running read_csv with ``warn_bad_lines=True`` would tell you.) Then provide ``read_csv`` with enough column names to cover the longest line, so it knows how many columns to expect. Lines that come up short (e.g., 45) will fill in NaNs. Errors only come from overly *long* lines, of which there will now be none. – Dan Allan Sep 09 '13 at 22:36
  • Dan-- thanks for your prompt, usable, and clear response! – M Hernandez Sep 10 '13 at 16:38

0 Answers0