R data frame organization

Question

I am a beginner data scientist working within an NCAA football program trying to learn R. We record repeated measures data during practice that needs to be organized then analyzed. I have this sample dataset that I am having trouble analyzing due to the variations/typos in the "Activity" column. I need a way to rename/organize the activities in the dataframe.

Raw Dataset looks like this:

Position	Activity	Max_Velocity (mph)
WR	Special Team 1	20
WR	special team 1	19
WR	Special team 1	18
WR	special Team 2	13
WR	ST 2	16
WR	St 2	19
WR	team 1	12
WR	team 1	20
WR	team 1	23
WR	team 2	21
WR	team 2	10
WR	team 2	3
WR	team 3	21
WR	team 3	11
WR	team 3	16
WR	Indy 1	20
WR	indy 2	21
WR	INDY 3	22

I need it to look like this:

Position	Activity	Max_Velocity (mph)	Activity_Catagory
WR	Special Team 1	20	Special Team
WR	special team 1	19	Special Team
WR	Special team 1	18	Special Team
WR	special Team 2	13	Special Team
WR	ST 2	16	Special Team
WR	St 2	19	Special Team
WR	team 1	12	Team
WR	team 1	20	Team
WR	team 1	23	Team
WR	team 2	21	Team
WR	team 2	10	Team
WR	team 2	3	Team
WR	team 3	21	Team
WR	team 3	11	Team
WR	team 3	16	Team
WR	Indy 1	20	Indy
WR	indy 2	21	Indy
WR	INDY 3	22	Indy

This way I can analyze the average of the dependent variable "Max_velocity" during each type of activity. I would rather not make individual lines of code correcting each typo in the "Activity" column. I think this can be done with loops but I am a beginner and don't know where to start. I would greatly appreciate any help to make this a simple fix.

You could probably start by [removing the number and space](https://stackoverflow.com/questions/52594952/r-gsub-numbers-and-space-from-variables) and then [upper case the first letters of words](https://stackoverflow.com/questions/6364783/capitalize-the-first-letter-of-both-words-in-a-two-word-string) — MrFlick, Jul 25 '21 at 04:33

score 2 · Answer 1 · answered Jul 25 '21 at 06:05

Here is another way to solve your problem

library(stringr)

df$Activity_Catagory <- str_to_title(df$Activity) |> 
                          str_remove(" \\d") |> 
                          str_replace("St", "Special Team")



#    Position       Activity Max_Velocity (mph) Activity_Catagory
# 1        WR Special Team 1                 20      Special Team
# 2        WR special team 1                 19      Special Team
# 3        WR Special team 1                 18      Special Team
# 4        WR special Team 2                 13      Special Team
# 5        WR           ST 2                 16      Special Team
# 6        WR           St 2                 19      Special Team
# 7        WR         team 1                 12              Team
# 8        WR         team 1                 20              Team
# 9        WR         team 1                 23              Team
# 10       WR         team 2                 21              Team
# 11       WR         team 2                 10              Team
# 12       WR         team 2                  3              Team
# 13       WR         team 3                 21              Team
# 14       WR         team 3                 11              Team
# 15       WR         team 3                 16              Team
# 16       WR         Indy 1                 20              Indy
# 17       WR         indy 2                 21              Indy
# 18       WR         INDY 3                 22              Indy

score 0 · Answer 2 · answered Jul 25 '21 at 04:33

You may need to first define the rules based on your data. Which values would be called as 'Special Team', which ones would be called as 'Team' etc.

Based on the example shared we can do case-insensitive pattern matching with grepl and list the conditions using case_when -

library(dplyr)

df %>%
  mutate(Activity_Category = case_when(grepl('Special Team', Activity, ignore.case = TRUE) | 
                                       grepl('ST ', Activity, ignore.case = TRUE) ~ 'Special Team', 
                                       grepl('Team', Activity, ignore.case = TRUE) ~ 'Team', 
                                       grepl('Indy', Activity, ignore.case = TRUE) ~ 'Indy'))

#   Position       Activity Max_Velocity..mph. Activity_Category
#1        WR Special Team 1                 20      Special Team
#2        WR special team 1                 19      Special Team
#3        WR Special team 1                 18      Special Team
#4        WR special Team 2                 13      Special Team
#5        WR           ST 2                 16      Special Team
#6        WR           St 2                 19      Special Team
#7        WR         team 1                 12              Team
#8        WR         team 1                 20              Team
#9        WR         team 1                 23              Team
#10       WR         team 2                 21              Team
#11       WR         team 2                 10              Team
#12       WR         team 2                  3              Team
#13       WR         team 3                 21              Team
#14       WR         team 3                 11              Team
#15       WR         team 3                 16              Team
#16       WR         Indy 1                 20              Indy
#17       WR         indy 2                 21              Indy
#18       WR         INDY 3                 22              Indy

R data frame organization

2 Answers2