I have a huge text dataset with diverse junks. The dummy data below is a representative of diversity of junks in the dataset. While outcome
column contains data to be cleaned, the text_group
column shows types of junks. The clean_outcome
is the result desired.
foo_df <- tibble(outcome = c(
"Dwayne Mourinho : PPP : 22/05/16 : WCY011068",
"Sarry Bedford : FamSuppr : 14/06/16 : ZAK0",
"Awanui Moutinho : FamChild : 14/02/14 : BAS007083",
"Allyson Bosere : Budgetng : 03/08/16 : XAP005407",
"Anneke Peter : PAFT : 17/12/12 : MAI005518",
"Budget for the Math",
"Parenting programs",
"WINZ",
"IRD",
"Baby First Aid Cert. - 00BC61FB",
"Stop taking drugs - 81868C49",
"Pamela Riri - LBM2925 - Breast Feeding",
"Afunbiowo Ige - GMY3480 - AOD",
"Paora Fowler - Literacy & Numeracy",
"Yang Wilson - Literacy & Numeracy",
"Samuel Bell - Literacy & Numeracy",
"COVID-19 Outcome - Johnson Buhari",
"Positive Parenting Programme (Triple P)-Kanu Babayaro",
"Goals : Wx000371",
"Mentoring : WO000372",
NA
),
text_group = c(
1,1,1,1,1,
2,2,2,2,
3,3,
4,4,
5,5,5,
6,6,
7,7,
8
),
clean_outcome = c(
"PPP",
"FamSuppr",
"FamChild",
"Budgetng",
"PAFT",
"Budget for the Math",
"Parenting programs",
"WINZ",
"IRD",
"Baby First Aid Cert.",
"Stop taking drugs",
"Breast Feeding",
"AOD",
"Literacy & Numeracy",
"Literacy & Numeracy",
"Literacy & Numeracy",
"COVID-19 Outcome",
"Positive Parenting Programme (Triple P)",
"Goals",
"Mentoring",
NA)
)
I found contributions from stackoverflowpage1 and stackoverflowpage2 to be partly useful but still do not completely clean the data (i.e. outcome
).
After thorough study of outcome
, I observe the following narrative:
if the text has date, the delimiter is colon and the data is between the 1st and 2nd delimiter
if the text has one dash delimiter, with no name, the data is before the delimiter
In most cases, if the text has one dash delimiter, with name, the data is after the delimiter
if the text has two dash delimiter, the data is after the second delimiter
if the text has one colon delimiter, the data is after the delimiter
This function:
colon_function <- function(x){
if(str_count(x, ":") == 3) {
trimws(str_split(x, ":")[[1]][[2]])
}else x}
cleans the data if the text has date, and the delimiter is colon and the data is between the 1st and 2nd delimiter (i.e. foo_df$text_group == 1
).
I need help to be include other conditions in an R code, such that I can have a result such as in clean_outcome
.
Thanks.