1

Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.

Example with header:

Field1 Field2 Field3
data1  data2  data3

Example without header:

data1  data2  data3

When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.

pd.read_csv('filename.csv', names=col_names)

When trying to use the below, it will drop the first row of data of there is no header in the file.

pd.read_csv('filename.csv', header=0, names=col_names)

My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.

df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
    del df
    df = pd.read_csv('filename.csv', names=col_names)

Is there a better way to handle this data set that doesn't involve potentially reading the file twice?

pyCthon
  • 11,746
  • 20
  • 73
  • 135
magic_frank
  • 161
  • 1
  • 9
  • 1
    Using Python or your shell language, you could more easily read only the first line without invoking Pandas. Example: [here](https://stackoverflow.com/a/1767589/8508004). – Wayne Nov 29 '21 at 20:37
  • 2
    I think it is better to read the file twice. You can set `nrows=0, header=None` in `pd.read_csv` (or whatever you are using) to tell pandas you want to read only the first line (saving memory) – Tarifazo Nov 29 '21 at 20:43

2 Answers2

1

Just modify your logic so the first time through only reads the first row:

# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
    kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)
1

You can do this with seek method of file descriptor:

with open('filename.csv') as csvfile:
    headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
    csvfile.seek(0)  # return file pointer to the beginning of the file

    # do stuff here
    if 'Field1' in headers:
       ...
    else:
       ...

    df = pd.read_csv(csvfile, ...)

The file is read only once.

Corralien
  • 109,409
  • 8
  • 28
  • 52