1

I have a data frame like this:

levels<- c("level 1", "LEVEL 1", "Level 1 ", "Level I", "Level I ", 
"level one", "Level one", "Level One", "Level 1")
df<- as.data.frame(levels)
> df
 levels
1 level 1
2 LEVEL 1
3 Level 1 #this one has a space at the end. 
4 Level I
5 Level I #this one also has a space at the end. 
6 level one
7 Level one
8 Level One
9 Level 1 #this is the correct format I want. 

As you can see some of them are in Upper Case format, some of them have a space at the end, some of them mark "1" as a number, as characters, and even in roman numerals.

I know I can just write multiple lines with gsub(), but I wanted to find a less tedious way to solve this problem.

This data frame also includes the same issue with level 2, and level 3 (such that "level 2", "level III ", "level II", "Level Two", "level three","Level TWO"). Moreover, this data also includes strings that are not just "level #" but other strings such as "Level 1 with specifications", "Level 2 with specifications", "Level 3 with specifications", "Level 1 with others included", "Moderate", "Mild", "Severe", etc..

I do not want to replace strings such as ("Level 1 with specifications", "Level 2 with specifications", "Level 3 with specifications", "Level 1 with others included", "Moderate", "Mild", "Severe", etc..), but want to replace all of the oddly formatted Levels into just "Level 1", "Level 2", "Level 3".

I tried this using apply(), for loops with gsub(). However, none of them seems to work. I think this is maybe because gsub() can't take on a list?

I also wanted to use regular expressions to grab a pattern using str_replace(), but I can't figure out how to. I have never used str_replace() and am new to regular expressions.

Any ideas?

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
lalaboo
  • 37
  • 7
  • are you dealing with a real db problem where your table has several million rows and you are looking to group over a field that looks like the example? how many levels are there? – Stephen Dec 07 '18 at 22:25
  • Yes, it has over 1500 rows, and it doesn't only include the three different levels, but other strings, such as "Mid", "Moderate", "Severe", "other", etc. – lalaboo Dec 07 '18 at 22:31

2 Answers2

0

If I'm understanding you, this should work.

# Make all letters lower case
df$levels = trimws(tolower(df$levels))

# Do the replacements ("|" for OR)
df$levels = gsub("three|iii", "3", df$levels)
df$levels = gsub("two|ii", "2", df$levels)
df$levels = gsub("one|i", "1", df$levels)

# Capitalize first letter
substr(df$levels, 1, 1) = toupper(substr(df$levels, 1, 1))
# Or to only capitalize the word "level"
#df$levels = gsub("level", "Level", df$levels)
mickey
  • 2,168
  • 2
  • 11
  • 20
  • Thank you! It worked for all levels with "three", "two", or "one" and with roman numerals! However, In levels there are other strings not related to levels as "Moderate", "Mild". which is the reason why the last line of code would not work for the data I have. Also it can't solve the problem of "level 1 "(which as a space at the end) vs "level 1". Do you have any idea I can just input the list `levels` into the first argument of gsub(), so that within df$levels column, it can go through all the strings in each row, and just replace every row that matches any element in the list `levels`? – lalaboo Dec 07 '18 at 22:26
  • To remove the trailing spaces, function `trimws`. – Rui Barradas Dec 07 '18 at 22:29
  • @lalaboo If there are other strings *not related* to the example dataset in the question, you should edit it with those strings. – Rui Barradas Dec 07 '18 at 22:31
  • @lalaboo See [this question](https://stackoverflow.com/questions/6364783/capitalize-the-first-letter-of-both-words-in-a-two-word-string) about capitalizing the first letter. And I'm not sure what you mean with the second half of your comment. Only the first argument of `gsub` is used. @RuiBarradas Thanks, edited. – mickey Dec 07 '18 at 22:32
  • @mickey I want something like `df$levels <- gsub(levels, "Level 1", df$levels)", like a apply function or a for loop. So that the function can go through the column examining each row, then finding/replacing strings that matches the elements in the list levels. – lalaboo Dec 07 '18 at 22:39
  • @RuiBarradas Thank you so much for your help!! I really appreciate it (: – lalaboo Dec 07 '18 at 22:40
  • @lalaboo Hmm, I think you're saying you want something like `gsub(c("level i", "level one", "level 1 "), "Level 1", df$levels)` where you give it the possible patterns to find. This should be done with "|" as I have in my answer, `gsub` would only use first element "level i" if you gave it a vector like that. That's not to say there probably isn't some workaround. – mickey Dec 07 '18 at 22:44
  • @mickey Thank you for your kind answer! It really helped (: – lalaboo Dec 07 '18 at 23:26
  • @RuiBarradas, I didn't want to add the extra strings _not_ related to example dataset in the question because I thought it would make the question more confusing, but I will definately edit my question! Thanks. – lalaboo Dec 07 '18 at 23:27
0

Here's a general approach allowing for levels to be in English words, Arabic or Roman numerals. The final output is always of the format "Level (Arabic numeral)".

library(english)
givePattern <- function(i)
  paste0("( |^)(", paste(i, tolower(as.character(as.roman(i))), as.character(english(i)), sep = "|"), ")( |$)")
fixLevels <- function(x, lvls)
  Reduce(function(y, lvl) replace(y, grep(givePattern(lvl), y), paste("Level", lvl)), lvls, init = tolower(x))

levels <- c(" level vi  ", "LEVEL Three  ", "   level thirteen", 
            "Level XXI", "level CXXIII", "    level fifty")
fixLevels(levels, 1:150)
# [1] "Level 6"   "Level 3"   "Level 13"  "Level 21"  "Level 123" "Level 50"

The first argument of fixLevels is a vector of characters, while the second argument is a vector of all levels to check for in the specified vector.

The function uses gsub to detect integer level i in any format, e.g.,

givePattern(132)
# [1] "( |^)(132|cxxxii|one hundred thirty two)( |$)"

meaning that we look for 132 or cxxxii or one hundred thirty two that is next to spaces and/or sentence beginning/end. Everything is done in lower case terms.

Now fixLevels utilizes givePattern. The anonymous function

function(y, lvl) replace(y, grep(givePattern(lvl), y), paste("Level", lvl))

takes some vector y, finds its elements where some form of level lvl is present, and replaces those elements with "Level lvl". Call this function f(y, lvl). We pass to Reduce this function f, a vector of levels lvls, and an initial vector tolower(x). Suppose that lvls is 1:3. What happens then is the following: r1 := f(x, 1), r2 := f(r1, 2), r3 := f(r2, 3), and we are done: r3 is out final output where each of the levels was taken care of.

Julius Vainora
  • 47,421
  • 9
  • 90
  • 102
  • @lalaboo, does my output format fit your case? I'm not entirely sure what you want to be done with those "Mid" and "Moderate". – Julius Vainora Dec 07 '18 at 22:45
  • Thank you! This is the closest code I have seen! I also forgot to mention that there are some rows in the data frame that has "Level 1 with specifications", "Level 2 with other", "Level 3 with specifications", etc. I am having a hard time understanding the function `fixLevels()`. But when I run it I am left with "Level 1", "Level 2", "Level 3" only (Which is great!! (: ). However, I wanted to keep those that had specifications as of right now. Is there a way to do that? If not, it's totally fine! I should have made my question clearer explaining what of data I have... – lalaboo Dec 07 '18 at 23:23
  • And if you have time, could you kindly explain what is happening in the `fixedLevels()` function? If I can figure this part out, I'm sure I can figure out those other strings such as "Mild", "Moderate", etc. Again, Thank you! – lalaboo Dec 07 '18 at 23:23
  • @lalaboo, see the update for an explanation. It looks like my function doesn't do anything with "Mild" or "Moderate" since there are no levels mentioned in any format. I can add one more version of this function to keep "with specification", but you must clearly specify the situation... Is it possible that there will be " level III with specification " and so on? – Julius Vainora Dec 07 '18 at 23:58
  • For levels "with specification" etc, the only variations are upper case of "LEVEL 1 with specification" or lower case "level 1 with specification" vs "Level 1 with with specification". Again, thank you so much for editing your answer, and your help! (: – lalaboo Dec 10 '18 at 17:03
  • @lalaboo, you could fix all those cases with `gsub("^level (\\d+) with specification$", "Level \\1 with specification", tolower(x))` and then `!grepl("^Level (\\d+) with specification$", x)` would show all the remaining cases; i.e., those that need to be fixed with the function in the answer. – Julius Vainora Dec 10 '18 at 17:16