R: Excluding diversity of junks from text data

Question

I have a huge text dataset with diverse junks. The dummy data below is a representative of diversity of junks in the dataset. While outcome column contains data to be cleaned, the text_group column shows types of junks. The clean_outcome is the result desired.

foo_df <- tibble(outcome = c(
  "Dwayne Mourinho : PPP : 22/05/16 : WCY011068",
  "Sarry Bedford : FamSuppr : 14/06/16 : ZAK0",
  "Awanui Moutinho : FamChild : 14/02/14 : BAS007083",
  "Allyson Bosere : Budgetng : 03/08/16 : XAP005407",
  "Anneke Peter : PAFT : 17/12/12 : MAI005518",
  "Budget for the Math",
  "Parenting programs",
  "WINZ",
  "IRD",
  "Baby First Aid Cert. - 00BC61FB",
  "Stop taking drugs - 81868C49",
  "Pamela Riri - LBM2925 - Breast Feeding",
  "Afunbiowo Ige - GMY3480 - AOD",
  "Paora Fowler - Literacy & Numeracy",
  "Yang Wilson - Literacy & Numeracy",
  "Samuel Bell - Literacy & Numeracy",
  "COVID-19 Outcome - Johnson Buhari",
  "Positive Parenting Programme (Triple P)-Kanu Babayaro",
  "Goals : Wx000371",
  "Mentoring : WO000372",
  NA
),
text_group = c(
  1,1,1,1,1,
  2,2,2,2,
  3,3,
  4,4,
  5,5,5,
  6,6,
  7,7,
  8
),
clean_outcome = c(
  "PPP",
  "FamSuppr",
  "FamChild",
  "Budgetng",
  "PAFT",
  "Budget for the Math",
  "Parenting programs",
  "WINZ",
  "IRD",
  "Baby First Aid Cert.",
  "Stop taking drugs",
  "Breast Feeding",
  "AOD",
  "Literacy & Numeracy",
  "Literacy & Numeracy",
  "Literacy & Numeracy",
  "COVID-19 Outcome",
  "Positive Parenting Programme (Triple P)",
  "Goals",
  "Mentoring",
  NA)
)

I found contributions from stackoverflowpage1 and stackoverflowpage2 to be partly useful but still do not completely clean the data (i.e. outcome).

After thorough study of outcome, I observe the following narrative:

if the text has date, the delimiter is colon and the data is between the 1st and 2nd delimiter

if the text has one dash delimiter, with no name, the data is before the delimiter

In most cases, if the text has one dash delimiter, with name, the data is after the delimiter

if the text has two dash delimiter, the data is after the second delimiter

if the text has one colon delimiter, the data is after the delimiter

This function:

colon_function <-  function(x){
if(str_count(x, ":") == 3) {
  trimws(str_split(x, ":")[[1]][[2]])
}else x}

cleans the data if the text has date, and the delimiter is colon and the data is between the 1st and 2nd delimiter (i.e. foo_df$text_group == 1).

I need help to be include other conditions in an R code, such that I can have a result such as in clean_outcome.

Thanks.

Thanks @Elin, the expected clean outcome is already in the ```clean_outcome``` column in dummy dataframe ```foo_df``` provided in the question. The text to be cleaned is in ```outcome``` column. — William, Jun 16 '21 at 00:06
So you will never have any other values? In that case it will be simpler to search the strings for the fixed lists. Also if you know that 100% the patterns you have shown will be followed you can work out logically the order to do the cleaning in. — Elin, Jun 16 '21 at 11:16

R: Excluding diversity of junks from text data

if the text has date, the delimiter is colon and the data is between the 1st and 2nd delimiter

if the text has one dash delimiter, with no name, the data is before the delimiter

In most cases, if the text has one dash delimiter, with name, the data is after the delimiter

if the text has two dash delimiter, the data is after the second delimiter

if the text has one colon delimiter, the data is after the delimiter

0 Answers0