Pandas use column names if do not exist

Question

Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.

Example with header:

Field1 Field2 Field3
data1  data2  data3

Example without header:

data1  data2  data3

When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.

pd.read_csv('filename.csv', names=col_names)

When trying to use the below, it will drop the first row of data of there is no header in the file.

pd.read_csv('filename.csv', header=0, names=col_names)

My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.

df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
    del df
    df = pd.read_csv('filename.csv', names=col_names)

Is there a better way to handle this data set that doesn't involve potentially reading the file twice?

Using Python or your shell language, you could more easily read only the first line without invoking Pandas. Example: [here](https://stackoverflow.com/a/1767589/8508004). — Wayne, Nov 29 '21 at 20:37
I think it is better to read the file twice. You can set `nrows=0, header=None` in `pd.read_csv` (or whatever you are using) to tell pandas you want to read only the first line (saving memory) — Tarifazo, Nov 29 '21 at 20:43

score 1 · Answer 1 · answered Nov 29 '21 at 21:09

Just modify your logic so the first time through only reads the first row:

# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
    kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)

score 1 · Answer 2 · answered Nov 29 '21 at 21:19

You can do this with seek method of file descriptor:

with open('filename.csv') as csvfile:
    headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
    csvfile.seek(0)  # return file pointer to the beginning of the file

    # do stuff here
    if 'Field1' in headers:
       ...
    else:
       ...

    df = pd.read_csv(csvfile, ...)

The file is read only once.

Pandas use column names if do not exist

2 Answers2