0

I have a data frame that contains strings representing 'Full Name'. Some are a complete, normal full name and some are not 'complete' or 'accurate' based on non-letter characters being present.

Example of dataframe:

Full name
----------

Mikki Clancy
Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
Clifton Gurlen
G�mez, Oscar Prieto
Sj�¶strand, Anders
Lisa Terry
Meloy, Wilson {old}
Gregory Stevens
Charles Gruenberg

df <- structure(list(Full_name = c("Jane Clancy",
                                       "Hermsdorfer, Mark (retired)",
                                       "CSP, PSECU Lan Unit (typo)",
                                       "Clif Gurlen",
                                       "G�mez, Oscar Prieto",
                                       "Sj�¶strand, Anders",
                                       "Liza Terry",
                                       "Meloy, Will {old}",
                                       "Garret Stevens",
                                       "Charly Ruenberg"), Group = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")), class = "data.frame", row.names = c(NA, -10L))

The ask is to subset the complete dataframe based on strings that contain non-ascii characters ( for example from above values - '{}, (), &, �').

Desired output would be a the column of names that contain those characters, and then the total count of rows so I can calculate the % from the complete dataframe that are 'not complete' or 'accurate'.

Not Complete Full name
----------------------

Hermsdorfer, Mark (retired)
CSP, PSECU Lan Unit (typo)
G�mez, Oscar Prieto
Sj�¶strand, Anders
Meloy, Wilson {old}
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Dinho
  • 704
  • 4
  • 15
  • Can you show the `dput` of the example so that it can be tested – akrun Jul 08 '21 at 17:55
  • Do you mean non-alphanumeric characters? {, }, (, ), & are all [ASCII characters](https://theasciicode.com.ar/), not even needing to use the "extended" ASCII character set. – Gregor Thomas Jul 08 '21 at 18:00
  • @akrun I will add that shortly, thank you for reminding me. – Dinho Jul 08 '21 at 18:04
  • @GregorThomas My mistake - yes, essentially I would like to filter all the strings that contain non-alphanumeric characters but I also would like to filter out (), & since typically those are not associated to a Full Name string – Dinho Jul 08 '21 at 18:05
  • 1
    You don't need the "also" - alphanumeric means literally "letters and numbers". Punctuation is not alphanumeric. Sounds like perhaps you don't want numbers either, so you want to filter out strings that contain anything other than letters, right? – Gregor Thomas Jul 08 '21 at 18:11
  • @GregorThomas That is exactly right – Dinho Jul 08 '21 at 18:15

2 Answers2

3

To take a broad view of letters, I've borrowed regex from this question about matching letters.

library(dplyr)
df %>% mutate(
  has_non_letters = grepl("[^\\p{L} ]", df$names, perl = TRUE)
)
#                          names has_non_letters
# 1                 Mikki Clancy           FALSE
# 2  Hermsdorfer, Mark (retired)            TRUE
# 3   CSP, PSECU Lan Unit (typo)            TRUE
# 4               Clifton Gurlen           FALSE
# 5   G<U+FFFD>mez, Oscar Prieto            TRUE
# 6         Sj�¶strand, Anders            TRUE
# 7                   Lisa Terry           FALSE
# 8          Meloy, Wilson {old}            TRUE
# 9              Gregory Stevens           FALSE
# 10           Charles Gruenberg           FALSE

I'll leave additional summarizing to you - you sum or mean the TRUE/FALSE values as you prefer.


Using this data:

df = data.frame(names = c(
"Mikki Clancy",
"Hermsdorfer, Mark (retired)",
"CSP, PSECU Lan Unit (typo)",
"Clifton Gurlen",
"G�mez, Oscar Prieto",
"Sj�¶strand, Anders",
"Lisa Terry",
"Meloy, Wilson {old}",
"Gregory Stevens",
"Charles Gruenberg"
))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • This is great - I think I need to read up more on regex bc I think I would like to keep some names that contain ',' or a '-' since sometimes the last name is first then first name. Can I add those characters into the grep statement? – Dinho Jul 08 '21 at 18:27
  • 2
    Sure, the `"[^\\p{L} ]"` pattern matches everything except letters, `\\p{L}` and spaces. (The `^` at the start is negation, that is, the "except" part.) If you want additional exceptions, just put them in there, `"[^\\p{L} ,-]"` will match strings that have anything that isn't a letter, space, comma, or dash. – Gregor Thomas Jul 08 '21 at 18:43
  • Brilliant! thank you - this is what I was hoping for. Easy enough then. – Dinho Jul 08 '21 at 18:46
1

We can use str_detect

library(dplyr)
library(stringr)
df %>% 
   filter(str_detect(Full_name, "[^A-Za-z, ]+"))
                    Full_name Group
1 Hermsdorfer, Mark (retired)     b
2  CSP, PSECU Lan Unit (typo)     c
3         G�mez, Oscar Prieto     e
4        Sj�¶strand, Anders     f
5           Meloy, Will {old}     h
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Is there anyway to add this into a pipe command? I am subsetting the dataframe based on certain columns and wanted to add this in as a 'summarise()' command – Dinho Jul 08 '21 at 18:16
  • 1
    @Dinho try the update – akrun Jul 08 '21 at 18:20
  • 1
    This is another great solution, thank you! I will play around with str_detect as I would like to not filter strings that include ',' and '-' as well. – Dinho Jul 08 '21 at 18:46
  • you can add the `-` after the `,` in regex – akrun Jul 08 '21 at 18:47