0

Hello I have a seemingly very simple problem but I cannot figure out how to easily resolve it. I have have a dataset where I have 52 different sampling locations and they all have awkward names. As a short example here is a small column with the problem I ran into and what I would like it to read:

ProblemColumn:   Ideal Column:
Site1            1
Site1            1
Site1            1
Site2            2
Site2            2
Site3            3
Site3            3
Site3            3
Site4            4
Site5            5
Location1        6
Location1        6
Location2        7

I'm interested in converting each instance of Site1 into the number 1, Site 2 into number 2, etc. Is there a way to easily convert each unique value in this column to a numeric value? Ideally I wouldn't have to manually assign a number to each unique location since that would be 52 lines but that's fine if it's inevitable. I apologize if this is super basic but my class rarely deals with this kind of thing and I haven't been able to figure out an easy way to achieve this.

Anon
  • 3
  • 2
  • 1
    Try any of these: df %>% mutate(newcol = parse_number(ProblemColumn)) or df %>% mutate(newcol = str_remove(ProblemColumn, '[A-z]+')) or gsub('[A-z]+','',df$ProblemColumn) ... where df is your dataframe – Karthik S Nov 29 '20 at 05:19
  • Hey, I think these are setting me in the right track. I'll play around with them and see if they help and report back. Probably in a few hours since I have to eat. Thank you! – Anon Nov 29 '20 at 05:27
  • `df$Id <- match(df$ProblemColumn, unique(df$ProblemColumn))` – Ronak Shah Nov 29 '20 at 06:18
  • Ronak, thank you so much for this solution! That did wonders!!!!! Exactly what I was hoping to do!!! Thank you so much – Anon Nov 29 '20 at 12:04

2 Answers2

0

Instead of extracting the number from the column, you could also try assigning a unique numeric ID for each group using dplyr's cur_group_id():

library(dplyr)

data %>%
  group_by(site) %>%
  mutate(site_id = cur_group_id())

With sample data:

data <- tibble(site = c("Site 1", "Site 1", "Site 1", "Site 2", "Site 2", "Site 3", "Site 3", "Site 3", "Site 4", "Site 5"))

This produces:

# A tibble: 10 x 2
# Groups:   site [5]
   site   site_id
   <chr>    <int>
 1 Site 1       1
 2 Site 1       1
 3 Site 1       1
 4 Site 2       2
 5 Site 2       2
 6 Site 3       3
 7 Site 3       3
 8 Site 3       3
 9 Site 4       4
10 Site 5       5
semaphorism
  • 836
  • 3
  • 13
  • Is there a way to do this without repeating Sites each time? I have 1,500+ entries so that wouldn't be realistic. And for the record, I can't easily extract the number from the Site name from my actual data. They have names like Ann1 and Aba1 so extracting numbers wouldn't work – Anon Nov 29 '20 at 05:40
  • 1
    With semaphorism's answer here, you don't have to do anything manually. The "cur_group_id" just writes out for you the number, that the grouping-algorithm has assigned to each site. His second and third code snippets are just an example - the entire thing is in his first snippet. – ZKA Nov 29 '20 at 06:05
0

If you want to keep the number and remove the characters, You can you str_remove_all

library(stringr)
> df <- str_remove_all(df$site, "[A-z]+")
 df <- data.table(df)
> df
    df
 1:  1
 2:  1
 3:  1
 4:  2
 5:  2
 6:  3
 7:  3
 8:  3
 9:  4
10:  5



> df <- tibble(site = c("Aba1", "AFF 1", "Site 1", "Site 2", "Site 2", "Site 3", "Site 3", "Site 3", "Site 4", "Site 5"))
> df <- str_remove_all(df$site, "[A-z]+")
> df
 [1] "1"  " 1" " 1" " 2" " 2" " 3" " 3" " 3" " 4" " 5"
> df <- data.table(df)
> df
    df
 1:  1
 2:  1
 3:  1
 4:  2
 5:  2
 6:  3
 7:  3
 8:  3
 9:  4
10:  5
Kian
  • 110
  • 7
  • I should have been more clear. It can't be as simple as just removing the Strings because the actual data has repeated numbers for different sites. For example Aba1 and AFF1 so that won't work – Anon Nov 29 '20 at 05:53
  • It doesn't matter if there be any other letters. look at the second part. – Kian Nov 29 '20 at 06:02