38

What is the best way to take a data file that contains a header row and read this row into a named tuple so that the data rows can be accessed by header name?

I was attempting something like this:

import csv
from collections import namedtuple

with open('data_file.txt', mode="r") as infile:
    reader = csv.reader(infile)
    Data = namedtuple("Data", ", ".join(i for i in reader[0]))
    next(reader)
    for row in reader:
        data = Data(*row)

The reader object is not subscriptable, so the above code throws a TypeError. What is the pythonic way to reader a file header into a namedtuple?

martineau
  • 119,623
  • 25
  • 170
  • 301
drbunsen
  • 10,139
  • 21
  • 66
  • 94

3 Answers3

49

Use:

Data = namedtuple("Data", next(reader))

and omit the line:

next(reader)

Combining this with an iterative version based on martineau's comment below, the example becomes for Python 2

import csv
from collections import namedtuple
from itertools import imap

with open("data_file.txt", mode="rb") as infile:
    reader = csv.reader(infile)
    Data = namedtuple("Data", next(reader))  # get names from column headers
    for data in imap(Data._make, reader):
        print data.foo
        # ...further processing of a line...

and for Python 3

import csv
from collections import namedtuple

with open("data_file.txt", newline="") as infile:
    reader = csv.reader(infile)
    Data = namedtuple("Data", next(reader))  # get names from column headers
    for data in map(Data._make, reader):
        print(data.foo)
        # ...further processing of a line...
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 2
    drbunsen: After doing this you can change processing loop to: `for data in map(Data._make, reader):`. – martineau Jul 12 '15 at 15:56
  • What if the csv data lacks a header? Is there a way to assign a name to a column? (If the CSV data lacks a named header, and you want to assign column names, then it looks like to me my only option is to read it in as a sequence of dictionaries). – Scott Prive Oct 04 '16 at 11:59
  • @Crossfit_and_Beer I don't really understand your comment. If you want to read the CSV file as a series of dictionaries, you would still need column names as keys, so where is the difference? If you want to use `namedtuple`s, you can simply declare the `namedtuple` type statically with fixed field names instead of `next(reader)`. The rest of the code remains the same. – Sven Marnach Oct 04 '16 at 12:05
  • @Jean-FrançoisFabre I reverted your change because the resulting code was wrong for both Python 2 and Python 3. In Python 2, `mode="rb"` is required, while in Python 3 `newline=""` is required. – Sven Marnach May 06 '17 at 15:42
  • @SvenMarnach you're right for writing, but not for reading. Do the test, you'll see. `newline=""` is only useful for some old versions of python 3 which insert 1 blank line after each row (same thing for latest 2.7 releases where `"rb"` isn't required. Check my Q&A: http://stackoverflow.com/questions/38808284/portable-way-to-write-csv-file-in-python-2-or-python-3 and test it by yourself. `open("data_file.txt")` (when reading) works for any version of python. Writing is something else, but seems to be ok without newline or wb in later versions of either 2 or 3 branch. – Jean-François Fabre May 06 '17 at 15:49
  • 1
    @Jean-FrançoisFabre I can't try it out, since I don't have access to a platform where `b` actually makes a difference, and I don't think it's necessary. Both the latest Python 2 and Python 3 documentations for the `csv` module state these requirements, so even if you found that it happens to work on some platforms for some inputs you are still using the API in an undocumented way, which might break at any time. – Sven Marnach May 06 '17 at 16:22
  • you're right about the documentation. I may ask some question about this sometimes. Raymond Hettinger is lurking on SO, he may have some say. – Jean-François Fabre May 06 '17 at 16:59
  • @SvenMarnach I think better will be to use `lambda` here, instead of protected method: `map(lambda i: Data(*i), reader)` – drjackild Oct 26 '17 at 14:57
  • @drjackild The method isn't _actually_ private. Quote from the [documentation](https://docs.python.org/3/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields): "To prevent conflicts with field names, the method and attribute names start with an underscore." And using `map()` with a lambda is frowned upon in Python. It's unnecessarily slow and better expressed as a list comprehension. – Sven Marnach Oct 26 '17 at 15:05
  • Is there an efficient way to specify data types for the read here? Everything seems to be coming through as string for me. – Brendan Oct 10 '18 at 20:21
  • @Brendan The easiest solution is to use the `pandas` CSV reader instead of the one from the standard library. If you want to use the one from the standarad lib, you can manually convert the data, e.g. `converters = [int, str, float]; row = [conv(x) for conv, x in zip(converters, row)]`. You will need some error handling, of course. – Sven Marnach Oct 10 '18 at 21:02
30

Please have a look at csv.DictReader. Basically, it provides the ability to get the column names from the first row as you're looking for and, after that, lets you access to each column in a row by name using a dictionary.

If for some reason you still need to access the rows as a collections.namedtuple, it should be easy to transform the dictionaries to named tuples as follows:

with open('data_file.txt') as infile:
    reader = csv.DictReader(infile)
    Data = collections.namedtuple('Data', reader.fieldnames)
    tuples = [Data(**row) for row in reader]
jcollado
  • 39,419
  • 8
  • 102
  • 133
  • 6
    Problem with this solution is that every row is converted to a dictionary, and then converted to the named tuple. Inefficient if the intermediate dictionary is not required. – Chris Cogdon Sep 15 '15 at 20:48
  • 1
    This doesn't preserve order, so the first column in your csv becomes a random one in your namedtuple. At that point, might as well use a dict. – hraban Feb 28 '17 at 10:04
0

I'd suggest this approach:

import csv
from collections import namedtuple

with open("data.csv", 'r') as f:
        reader = csv.reader(f, delimiter=',')
        Row = namedtuple('Row', next(reader))
        rows = [Row(*line) for line in reader]

If you work with Pandas, the solution becomes even more elegant:

import pandas as pd
from collections import namedtuple

data = pd.read_csv("data.csv")
Row = namedtuple('Row', data.columns)
rows = [Row(*row) for index, row in data.iterrows()]

In both cases you can interact with the records by field names:

for row in rows:
    print(row.foo)
Roman
  • 411
  • 3
  • 10
  • 1
    I don't think the `Row = namedtuple('Row', next(reader))` will work the way you have it because the second argument to `namedtuple` is supposed to be the fieldnames of the tuple subclass, which "are a sequence of strings such as `['x', 'y']`" according the the [documentation](https://docs.python.org/3/library/collections.html#collections.namedtuple). You're also repeatedly creating the `reader` in the loop. – martineau Apr 27 '20 at 15:34