Identify which of two vectors is numeric and which is strings in R (but more generally for other platforms as well)

Question

I need to write a function that identifies which of two vectors it receives is (the most likely to be) the numeric vector and which is (most likely to be) the character vector.

The two vectors might look something like this:

vec1 <- c("2", "3", "14", "7")
vec2 <- c("Arctic tern", "Blue tit", "bald eagle", "Cassowary")

But this is intended for use by people who are not necessarily computer literate so it may get the odd...

vec1 <- c("2", "3", "fourteen", "7")

...instead, so it has be flexible.

The text could be full sentences or single characters and may have numeric digits mixed in with it too like "2for1" or "world war 2" so this must be accounted for. That's why I'm looking for a function to pick what it thinks is the "most likely" numeric vector of the two.

Any ideas? I think the "Levenshtein distance" might be helpful but it's hard to say how. I'm working specifically in R but a general purpose algorithm / solution would be fine.

EDIT: The solution posed does not answer the question. Of course I am familiar with basic data formatting. The issue here is that there are two vectors and I need an algorithm (however rough) that will guess which is more likely to be the numeric of the two. But the data that goes into it could be quite messy and might not nicely fall into the bounds of a numeric vector and setting both vectors to "strings" is not an acceptable outcome. Please re-open my question.

So whats the expected value for a vector like this: `vec3 <- c("word23 in 22 is like 234", "22", "word")` — Andre Wildberg, Apr 25 '23 at 12:54
@ismirhregal I'd argue the question is a little bit wider than just a `type.convert` answer. In the case of `c("two", "three", "fourteen", "seven")` for instance, this should be more likely than `c("hello", "three", "2", "seven")` — Maël, Apr 25 '23 at 12:54
there are packages to convert alphabetic to numeric numbers: https://stackoverflow.com/a/71108381/20513099 , which might be worth a try upstream of `type.convert` — I_O, Apr 25 '23 at 13:02
In the case of somebody writing "fourteen" then this entry in the vector will just be deleted. There's no need to convert it to a number. The point is that I won't just be able to use "as.numeric" and make sure it works. — Leonhard Euler, Apr 25 '23 at 13:03
@AndreWildberg it depends what the other vector is. Probably for the context that I am thinking of, your vec3 will be the more likely character vector. — Leonhard Euler, Apr 25 '23 at 13:04
Convert to numeric and see which has the fewest missing values. Use `as.numeric()` or `readr::parse_number()` which is a little more flexible allowing for commas and currency signs, etc. If you're worried about spelled out numbers then make a table of the spelled out numbers up to 100 or 1000 and use `gsub` to replace them before attempting a numeric conversion. — Gregor Thomas, Apr 25 '23 at 13:10
@LeonhardEuler Ok, but to make it comparable it needs a value attached by some metric. Once you have a metric based on solid rules you win. I guess once you decided what rules apply you gonna get good answers. Until then its more wild guessing. Also, how to treat multiple numbers in a string? — Andre Wildberg, Apr 25 '23 at 13:12
@AndreWildberg If I knew what metric I needed then I would write the function myself Andre. For multiple numbers "23" would be the same as 23. "2 3" is better than "two three" which is better than "giraffe" but will ultimately be deleted from the final dataset. It would just help to suggest that it is in the numeric vector. — Leonhard Euler, Apr 25 '23 at 13:34

score 3 · Accepted Answer · answered Apr 25 '23 at 13:09

Something like this:

library(english)

foo <- function(...) {
  stopifnot("input vectors must have identical lengths" = 
             length(unique(lengths(list(...)))) == 1L)
  numwords <- setNames(1:100, english(1:100))
  nums <- lapply(list(...),
                 function(x) ifelse(unname(is.na(numwords[x])), 
                        x, 
                        numwords[x])
  )
  
  
  suppressWarnings(
  nums <- lapply(nums, as.numeric)
  )
  which.min(vapply(nums, \(x) sum(is.na(x)), integer(1)))
  
}

vec1 <- c("2", "3", "14", "7")
vec2 <- c("Arctic tern", "Blue tit", "bald eagle", "Cassowary")
foo(vec1, vec2)
#[1] 1

vec3 <- c("apple", "orange", "three", "moon")
foo(vec2, vec3)
#[1] 2

foo(vec1, vec2, vec3)
#[1] 1

Maël · Answer 2 · 2023-04-25T13:13:48.913

A first step could be this. Basically the function outputs the number of elements in your input vector that have digits in it, including written numbers. To construct the number vector, you can use built-in function (as seen here)

number <- setNames(as.character(0:20), c("zero", "one", "two", "three", "\\bfour\\b", "five", "\\bsix\\b", "\\bseven\\b",
                                         "\\beight\\b", "\\bnine\\b", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty"))

f <- 
  function(x){
    sapply(as.list(x), function(x){
      x <- stringr::str_replace_all(x, number)
      x <- as.numeric(gsub("\\D", "", x))
      complete.cases(x)
    }) |> sum()
  }

vec1 <- c("2", "three gloves", "fourteen", "7")
vec2 <- c("Arctic tern", "Blue tit", "bald eagle", "Cassowary")
f(vec1)
#[1] 4

And then, comparing them:

f_compare <- function(v1, v2) which.max(c(f(v1), f(v2)))
f_compare(vec1, vec2)
#[1] 1

Identify which of two vectors is numeric and which is strings in R (but more generally for other platforms as well)

2 Answers2