-4

I have a large data.frame in R with thousands of rows and 4 columns. For example:

   Chromosome    Start      End Count
1 NC_031985.1 16255093 16255094     1
2 NC_031972.1 11505205 11505206     1
3 NC_031971.1 24441227 24441228     1
4 NC_031977.1 29030540 29030541     1
5 NC_031969.1   595867   595868     1
6 NC_031986.1 40147812 40147813     1

I have this data.frame with the chromosome names accordingly

LG1     NC_031965.1
LG2     NC_031966.1
LG3a    NC_031967.1
LG3b    NC_031968.1
LG4     NC_031969.1
LG5     NC_031970.1
LG6     NC_031971.1
LG7     NC_031972.1
LG8     NC_031973.1
LG9     NC_031974.1
LG10    NC_031975.1
LG11    NC_031976.1
LG12    NC_031977.1
LG13    NC_031978.1
LG14    NC_031979.1
LG15    NC_031980.1
LG16    NC_031987.1
LG17    NC_031981.1
LG18    NC_031982.1
LG19    NC_031983.1
LG20    NC_031984.1
LG22    NC_031985.1
LG23    NC_031986.1

I want to replace all row names of the large matrix with the chromosome names as listed above and get:

   Chromosome    Start      End Count
1 LG22        16255093 16255094     1
2 LG7         11505205 11505206     1
3 LG6         24441227 24441228     1
4 LG12        29030540 29030541     1
5 LG4           595867   595868     1
6 LG23        40147812 40147813     1

Does anybody know which is the less painful way to do this? It might be easy (or not) but my experience in R is limited.

Many thanks!

Sotos
  • 51,121
  • 6
  • 32
  • 66
Ioannis
  • 43
  • 1
  • 5
  • Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), including your desired output. – lmo Sep 28 '17 at 11:45
  • try biomart...but you might get more help at https://www.biostars.org/ – Roman Sep 28 '17 at 11:46
  • I have edited the question. – Ioannis Sep 28 '17 at 12:00
  • 1
    If you join your tables by "chromosome" you'll get what you want. See an example here: https://rpubs.com/NateByers/Merging – AntoniosK Sep 28 '17 at 12:27
  • Thanks for the link! I would like to ask a quick question though because dplyr does not assign the function "left_join" to an object. So if I > left_join(mydata, list, by = "chromosome") it will keep mydata object as it is and add the extra information from the list? – Ioannis Sep 28 '17 at 12:59
  • Actually, if I do not assign this command to an object it prints out the joined data.frame. If I assign this command it comes up with a warning message: Column `Chromosome` joining factors with different levels, coercing to character vector – Ioannis Sep 28 '17 at 13:04
  • 1
    The answer to my last comment is here: https://stackoverflow.com/questions/30468412/dplyr-join-warning-joining-factors-with-different-levels Thanks again – Ioannis Sep 28 '17 at 13:12
  • You can make sure that both factors have the same levels before merging, as that post suggests, or you can just work with character (and not factor) variables, where you don't have to change anything. – AntoniosK Sep 28 '17 at 13:26

1 Answers1

0

As discussed in the comments here is the dplyr solution if people are looking:

library(dplyr)
df %>%
  inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
  select(Chromosome = V1, Start, End, Count) 

This gives a warning message that the two merging columns has different factor levels. You can either ignore that and work with characters or convert the merged column to a factor like:

df %>%
  inner_join(chromo_names, by = c("Chromosome" = "V2")) %>%
  select(Chromosome = V1, Start, End, Count) %>%
  mutate(Chromosome = as.factor(Chromosome))

Here is a Base R solution:

merged = merge(df, chromo_names, 
               by.x = "Chromosome", 
               by.y = "V2", 
               sort = FALSE)

merged = merged[c(5,2:4)]
names(merged)[1] = "Chromosome"

Result:

  Chromosome    Start      End Count
1       LG22 16255093 16255094     1
2        LG7 11505205 11505206     1
3        LG6 24441227 24441228     1
4       LG12 29030540 29030541     1
5        LG4   595867   595868     1
6       LG23 40147812 40147813     1

Data:

df = read.table(text = "   Chromosome    Start  End Count
                1 NC_031985.1 16255093 16255094     1
                2 NC_031972.1 11505205 11505206     1
                3 NC_031971.1 24441227 24441228     1
                4 NC_031977.1 29030540 29030541     1
                5 NC_031969.1   595867   595868     1
                6 NC_031986.1 40147812 40147813     1", header = TRUE)

chromo_names = read.table(text = "LG1     NC_031965.1
                         LG2     NC_031966.1
                         LG3a    NC_031967.1
                         LG3b    NC_031968.1
                         LG4     NC_031969.1
                         LG5     NC_031970.1
                         LG6     NC_031971.1
                         LG7     NC_031972.1
                         LG8     NC_031973.1
                         LG9     NC_031974.1
                         LG10    NC_031975.1
                         LG11    NC_031976.1
                         LG12    NC_031977.1
                         LG13    NC_031978.1
                         LG14    NC_031979.1
                         LG15    NC_031980.1
                         LG16    NC_031987.1
                         LG17    NC_031981.1
                         LG18    NC_031982.1
                         LG19    NC_031983.1
                         LG20    NC_031984.1
                         LG22    NC_031985.1
                         LG23    NC_031986.1", header = FALSE)
acylam
  • 18,231
  • 5
  • 36
  • 45