3

When reading a table while specifying duplicate column names - let's say two different names - pandas 0.16.1 will copy the last two columns of the data over and over again.

In [1]:
​
df = pd.read_table('Datasets/tbl.csv', header=0, names=['one','two','one','two','one']) 
df

tbl.csv contains a table with 5 different columns. The last two will be repeated instead of giving all columns.

Out[1]:
one two one two one
0   0.132846    0.120522    0.132846    0.120522    0.132846
1   -0.059710   -0.151850   -0.059710   -0.151850   -0.059710
2   0.003686    0.011072    0.003686    0.011072    0.003686
3   -0.220749   -0.029358   -0.220749   -0.029358   -0.220749

The actual table has different values in every column. Here, the same two columns (corresponding to the two last ones in the file) are repeated. No error or warning is given.

Do you think this is a bug or is it intended? I find it very dangerous to silently change an input like that. Or is it my ignorance?

Jens
  • 33
  • 1
  • 7
  • I made it more explicit in the text now. The output consists of two columns that are repeated. The original file has 5 different columns. – Jens Jul 03 '15 at 13:18
  • Thanks, I got that. My questions was why it is handled in this way instead of giving me an error. – Jens Jul 03 '15 at 13:25
  • I think that this is designed to work like this as it'll assign the names to each column and in `pandas\io\parsers.py` it builds a dict from these values `data = dict((k, v) for k, (i, v) in zip(names, data))` so you overwrite the names with the last column assignment – EdChum Jul 03 '15 at 13:26
  • I think this is a pretty entertaining bug: well found. (Confirmed on Python 2.7 too) It occurs for `read_csv` too. I think the overhead to check for duplicate column names is probably worth the total weirdness that this situation demonstrates. – LondonRob Jul 03 '15 at 13:27
  • Nowadays we get `ValueError: Duplicate names are not allowed.` – Armali Feb 28 '21 at 12:27

1 Answers1

3

Using duplicate values in indexes are inherently problematic. They lead to ambiguity. Code that you think works fine can suddenly fail on DataFrames with non-unique indexes. argmax, for instance, can lead to a similar pitfall when DataFrames have duplicates in the index.

It's best to avoid putting duplicate values in (row or column) indexes if you can. If you need to use a non-unique index, use them with care. Double-check the effect duplicate values have on the behavior of your code.

In this case, you could use

df = pd.read_csv('data', header=None) 
df.columns = ['one','two','one','two','one']

instead.

ChaimG
  • 7,024
  • 4
  • 38
  • 46
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks for the answer. I understand the problem behind duplicate values. My question was more about the motivation of not letting the call fail in the first place. I thought there might be a deeper reason in letting you import the table but then giving you an output that is changed in a non-obvious way. I'm very new to python and pandas and I thought maybe I didn't get a basic principle. If you think this is a bug I would report it. – Jens Jul 03 '15 at 13:23