-2

My raw data has 3 columns; one of them is called First_Name. The First_name column has actual first names such as Prabhatand Tonyin it but also a lot of invalid strings, i.e, strings that do not represent actual first names such as email addresses like Prabhat@gmail.com or strings with numbers and special characters like aaa261. So what I want to do is filter out the valid First_Name strings.

Here are the steps I am taking:

1st step:

c <- read.csv("Test_Data.csv", TRUE, ",") .

2nd step:

First_Name <- pull(c, firstname) # pulling First_Name column from Raw Data. 

3rd step:

df[] <- lapply(df[], as.character)

4th step:

df$new <- ifelse(grepl("[^A-z]", df$First_Name), "NA", df$First_Name)

But it's not working and giving me an error:

"Error in $<-.data.frame(*tmp*, new, value = logical(0)) : replacement has 0 rows, data has 50000" .
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • Give a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), and show what you have tried. See [this post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) also. – R. Schifini Mar 20 '20 at 15:29
  • Hi @Prabhat Passi. I've edited your question. Is the edit okay, i.e., does it make your question clearer? – Chris Ruehlemann Mar 24 '20 at 08:05
  • You need to show what's in `c`. Apparently, you have in `c`a column entitled `firstname`--correct? If so, why do you say you have a column `First_name`in you raw data--that's misleading. Second. I'm not familiar with the function `pull`--does it exist? Third, where on earth does dataframe `df` come from? If it's from an answer by @Chris, that nomenclature is just for a mock dataframe. Your dataframe will obviously have a different name so you need to use *that* name! And if you don't have a dataframe called `df`it is **inevitable** that step #4 throws an error! – Chris Ruehlemann Mar 24 '20 at 08:11
  • Here's what you need to do: show us **exactly** what `c` looks like--not the whole dataframe, just the first 5 rows or so, including the column names and the data in these first rows. Then ppl might be able to help you better. – Chris Ruehlemann Mar 24 '20 at 08:13

1 Answers1

0

EDIT

Not quite sure what you want. Here are two solutions:

DATA:

df <- data.frame(
  First_Name = c("Prabhat", "Ray", "ben", "Tony", "Prabhat@gmail.com", "aaa261", "aa?w", "123asd", "Bruce", "Aston", "Passi@yahoo.com"))

df
          First_Name
1            Prabhat
2                Ray
3                ben
4               Tony
5  Prabhat@gmail.com
6             aaa261
7               aa?w
8             123asd
9              Bruce
10             Aston
11   Passi@yahoo.com

Convert to character:

df[] <- lapply(df[], as.character)

First SOLUTION:

In this solution, you create a new column with first names and NAs replacing the non-names. The replacement is achieved by an ifelse clause, grepl and the pattern [^A-z], which matches any strings that do not consist of letters only:

df$new <- ifelse(grepl("[^A-z]", df$First_Name), "NA", df$First_Name)

RESULT:

df
          First_Name     new
1            Prabhat Prabhat
2                Ray     Ray
3                ben     ben
4               Tony    Tony
5  Prabhat@gmail.com      NA
6             aaa261      NA
7               aa?w      NA
8             123asd      NA
9              Bruce   Bruce
10             Aston   Aston
11   Passi@yahoo.com      NA

Second SOLUTION:

If you are just interested in 'fetching', as you say, the first names, which suggest you may want to collect them in a vector, then this can be done thus:

grep("[^A-z]", as.character(unlist(df$First_Name)), value = T, invert = T)

RESULT:

[1] "Prabhat" "Ray"     "ben"     "Tony"    "Bruce"   "Aston" 

Hope one of these tips are helpful to you.

Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • I have a column as "First Name" and in that column i have multiple rows. I have mail id,special characters & numbers in "First Name" field . Now i want to fetch only the valid names(eg Ray,Bruce,Prabhat,etc).PFB the raw data example... First_Name Prabhat Ray ben Tony Prabhat@gmail.com aaa261 aa?w\ 123asd Bruce Aston Passi@yahoo.com – Prabhat Passi Mar 21 '20 at 16:52
  • Hi @PrabhatPassi I've adapted the two solutions to the data you sent. I had to remove the back slash though manually. – Chris Ruehlemann Mar 21 '20 at 17:13
  • Hi,i tried but both the solutions are giving me error.c <- read.csv("Test_Data.csv",TRUE,","), First_Name <- pull(c,firstname), df[] <- lapply(df[], as.character), df$new <- ifelse(grepl("[^A-z]", df$First_Name), "NA", df$First_Name), dd <- grep("[^A-z]", as.character(unlist(df$First_Name)), value = T, invert = T)..This is what i am doing but i am getting error – Prabhat Passi Mar 23 '20 at 15:33
  • Which error message do you get exactly? And what does your data **REALLY** look like? It would be best if you edited your question to include as much of your actual data as possible! – Chris Ruehlemann Mar 23 '20 at 15:45
  • Error in `$<-.data.frame`(`*tmp*`, new, value = logical(0)) : replacement has 0 rows, data has 50000 . This is the error message that i'm getting – Prabhat Passi Mar 23 '20 at 19:49
  • My Raw Data has 3 columns. and i am pulling the First_Name column from that and then further putting conditions on First_Name column and filtering out valid First_Names. Will put this in step by step. 1st step = c <- read.csv("Test_Data.csv",TRUE,",") . 2nd step = First_Name <- pull(c,firstname) - Pulling First_Name column from Raw Data. 3rd step = df[] <- lapply(df[], as.character) . 4th step = df$new <- ifelse(grepl("[^A-z]", df$First_Name), "NA", df$First_Name) – Prabhat Passi Mar 23 '20 at 19:49
  • Can you please edit your question to include this and potentially other relevant information? I'd love to help you solve your issue but I need clarity for that. – Chris Ruehlemann Mar 23 '20 at 20:38
  • Hi Chris, i have made necessary changes in my question. – Prabhat Passi Mar 23 '20 at 22:02