This related question points out a part of the ?read.table
documentation that explains your problem:
If there is a header and the first row contains one fewer field
than the number of columns, the first column in the input is used
for the row names. Otherwise if row.names is missing, the rows are numbered.
Your header row likely has 1 fewer column than the rest of the file and so read.table
assumes that the first column is the row.names (which must all be unique), not a column (which can contain duplicated values). You can fix this by using one of the following two Solutions:
- adding a delimiter (ie
\t
or ,
) to the front or end of your header row in the source file, or,
- removing any trailing delimiters in your data
The choice will depend on the structure of your data.
test.csv Example:
If your test.csv looks like this:
v1,v2,v3
a1,a2,a3,
b1,b2,b3,
By default, read.table
interprets this file as having one fewer header columns than the data because the delimiters don't match. This is how it is interpreted by default:
v1,v2,v3 # 3 items!! (header row)
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
The values in the first column (with no header) are interpreted as row.names: a1
and b1
. If this column contains duplicate values, which is entirely possible, then you get the duplicate 'row.names' are not allowed
error.
If you set row.names = FALSE
, the header row shift doesn't happen, but you still have a mismatching number of columns in the header and in the data because the delimiters don't match.
This is how it is interpreted with row.names = FALSE
:
v1,v2,v3 # 3 items!! (header row)
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
Solution 1
Add trailing delimiter to header row:
v1,v2,v3, # 4 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
Or, add leading delimiter to header row:
,v1,v2,v3 # 4 items!!
a1,a2,a3, # 4 items
b1,b2,b3, # 4 items
Solution 2
Remove excess trailing delimiter from non-header rows:
v1,v2,v3 # 3 items
a1,a2,a3 # 3 items!!
b1,b2,b3 # 3 items!!