0

i need help i'm new to R and programming as well. I have a large dataset and i want to look for any inconsistency in data formatting in a specific column, the datatype is 'chr' in that column. the consistent format should contain 'I-XXXXXXXXX' X is random number.

i tried this length(df[!grepl('I', df$column1),]) but it didnt work.

  • 2
    could you please provide a small section of your data frame? what doesn't work? See this https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for how to create a reproducible example that will make it easier to get help. – fmic_ Jun 06 '23 at 18:53
  • 1
    I'd suggest `grepl('^I-[0-9]+$', df$column1)` if the pattern you describe is supposed to be the entire string. And if there is supposed to be a consistent number of digits, then use an exact quantifier, like `[0-9]{9}` for 9 of them. – Gregor Thomas Jun 06 '23 at 19:01

1 Answers1

1

We can use regular expressions (regex) to check if the values in a column conform to your specifications. The regular expression I use below can be explained as follows:

  • ^I- checks if the value starts with "I-"; the circumflex (^) character stands for the beginning of the line.
  • after the prefix we expect any numeric value [0-9] and we want to repeat that check 9 times: {9}.
  • To make sure that after the 9 numeric values no additional values are present, we add the end of line anchor $.
df1 <- data.frame(column1 = c("I-123456789", "P-888888888", "Q"))

# Tidyverse
library(tidyverse)
df1 |> 
  mutate(check = str_detect(column1, "^I-[0-9]{9}$")) 
#>       column1 check
#> 1 I-123456789  TRUE
#> 2 P-888888888 FALSE
#> 3           Q FALSE

# Base R
df1$check <- grepl("I-[0-9]{9}$", df1$column1)
df1
#>       column1 check
#> 1 I-123456789  TRUE
#> 2 P-888888888 FALSE
#> 3           Q FALSE
Till
  • 3,845
  • 1
  • 11
  • 18