1

I have a csv file which has no header columns and it has variable length records in each line.

Each record can go upto 398 fields and I want to keep only 256 fields in my dataframe.As I need only those fields to process.

Below is a slim version of the file.

1,2,3,4,5,6
12,34,45,65
34,34,24

In the above I would like to keep only 3 fields(analogous to 256 above) from each line while calling the read_csv.

I tried the below

import pandas as pd
df = pd.read_csv('sample.csv',header=None)

I get the following error as pandas taking the 1st to generate the metadata.

  File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 4, saw 10

Only solution I can think of is using

names = ['column1','column2','column3','column4','column5','column6']

while creating the data frame.

But for the real files which can be upto 50MB I don't want to do that as that is taking a lot of memory and I am trying to run it using aws lambda which will incur more cost. I have to process a large number of files daily.

My question is can I just create a dataframe using the slimmer 256 field while reading the csv alone? Can that be my step one ?

I am very new to pandas so kindly bear my ignorance. I tried to look for a solution for a long time but could find one.

  • 1
    try using `usecols` (read more in the [docs](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html))... but since csv's are just text files pandas still has to load and read the full file to identify the columns, `usecols` just controls what is parsed into the dataframe – RichieV Sep 02 '20 at 20:41
  • 1
    consider using [.to_hdf](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_hdf.html#pandas.DataFrame.to_hdf) for quick columnar access with a binary file – RichieV Sep 02 '20 at 20:43

1 Answers1

1
# only 3 columns
df = pd.read_csv('sample.csv', header=None, usecols=range(3))
print(df)
#     0   1   2
# 0   1   2   3
# 1  12  34  45
# 2  34  34  24

So just change range value.

Danila Ganchar
  • 10,266
  • 13
  • 49
  • 75