1

I am renaming the columns of a data frame(Data), in R, with the names stored in a character array.

If two names are same in the character array(Names), e.g ("JK","JK","test","hi")

using,

colnames(Data) <- Names
colnames(Data)

Output:

"JK" "JK.1" "test" "hi"

Desired output:

"JK" "JK" "test" "hi"

I am not able to figure out why .1 is appended to the second name.

Any suggestions on how to avoid this?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Natasha
  • 1,111
  • 5
  • 28
  • 66
  • In R, a `data.frame` should not have duplicate column names. This is strongly-enough felt that it is the default behavior, with `data.frame(..., check.names=FALSE)`. In your example, it is unclear with `Data$JK` or `Data[["JK"]]` should return. However, you can always allow this behavior with `data.frame(a=1, a=2, check.names=FALSE)`. – r2evans Sep 03 '18 at 04:31
  • @r2evans Data$JK holds for my case. Could you please state what a=1 and a=2 means? – Natasha Sep 03 '18 at 04:53
  • 1
    *Which version of* `Data$JK`? My sample was a way to demonstrate making a data.frame with two columns with the same name. Assign that frame to (say) `dat` and then (1) see which value of `dat$a` you get, and then (2) how to you get to the second column named `dat$a`? There is a direct way, of course, but there exist functions that work on frames that do not always return the columns in exactly the same order as what you put in. This means you may not have assurance on which of the two identically-named columns you are reference. Bottom line: bad idea. – r2evans Sep 03 '18 at 04:58
  • Frankly, your code doesn't make sense to me. If I make a fake data.frame with `dat <- data.frame(a=1,a=2)` (even with the default `check.names=TRUE`) and then do `colnames(dat) <- c("a","a")`, it silently and without warning names them identically. From this, I can only infer that `Names` has the names exactly as they are output, it's not the renaming that is failing. How are you generating `Names`? If you hope for more of an answer than you currently have, please make this question reproducible, good refs: https://stackoverflow.com/questions/5963269/ and https://stackoverflow.com/help/mcve. – r2evans Sep 03 '18 at 05:03
  • @r2evans That makes things clear. As you rightly pointed out, the index of the columns with the same name is not known apriori and the number of columns for my true case in ~5000. The names that are present in the character array comes by parsing files.I'm afraid I wouldn't be able to add it here. – Natasha Sep 03 '18 at 05:06

2 Answers2

2

The reason why column names are changed is based on the make.unique call in data.frame which changes the duplicate column names

make.unique(c("JK", "JK", "JK", "test"))
#[1] "JK"   "JK.1" "JK.2" "test"

We can use sub to match the . (. is a metacharacter implies any character - so escape \\ it to get the literal meaning) followed by one or more digits (\\d+) to the end ($) of the string and replace it with blank ("")

names(Data) <- sub("\\.\\d+$", "", names(Data))
names(Data)
#[1] "JK"   "JK"   "test" "hi"  

Or another option is str_remove

library(stringr)
names(Data) <- str_remove(names(Data), "\\.\\d+#$")

NOTE: It is better to have unique column names in a data instead of duplicated names

akrun
  • 874,273
  • 37
  • 540
  • 662
2

I am not able to figure out why .1 is appended to the second name.

This is because colnames of a dataframe must be unique. How will you be able to select a column if two columns have the same name? In order to avoid .1 being appended to the colname, make sure your names array has all unique elements. You can write a function check for duplicates in names array and replace with something logical.

someone
  • 149
  • 8