1

I have data in excel which I am trying to analyze in R. How do I split the column in my dataframe which contains information in key-value pair into multiple columns.

for example:

Id        col1
1         Name: Joe Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua UID: 23456 Location:NY CTN: ************************************
2         CustName: Kim Putok Ctn: ************************************ Location: CA
3         Customer Name: GLORIA ;      CTN: ************************************      Affected CTN: ************************************ Location: Egypt
4         Customer Name: ; ;Leahi ;  Reason for the Call: ; ; ; Location: CA discription: xyz
5         auto pop / suspended line -cx speaking spanish -transfered call


Please note here data in col1 is not in specific format and it contains lot of delimiters, data in the next line in same cell, data not in key-value pair (such type of data needs to be removed).

How do I tackle or clean such kind of data? How do I split such data in key-value pair across multiple columns? Any suggestion will be great help.

--EDIT-- My original data

> dput(head(asummy, 10))
structure(list(`arrange_df[, 12]` = c("Customer Name: JOE\r\nReason for the Call: BP Set Up\r\nResolution: \r\nvisual audit on the account\r\nset expectations in BP\r\noffered call back\r\nCenter Location: Clark\r\nCTN (Number Calling About): ************************************\r\nAVAYA (Number Calling From): ************************************\r\nAgent UID: cm***************************u\r\n", 
"Name: Kim Putok\r\nCtn: ************************************\r\nReason for calling: lost phone/ ************************************/ follow up on insurance claim/ routed to asurion\r\nResolution:\r\nLocation: Clark\r\nattuid: gb***************************r\r\n", 
"Customer Name: Heather \r\nReason for the Call:    got suspended / card issue / supposedly paid / complains she just updated online her card then got susp\r\nResolution:  \r\nCenter Location: Clark\r\nCTN:  ************************************\r\nAlt No:  ************************************\r\nCredit Reason:  ********* courtesy \r\nAgent UID: rc***************************w/I-EQR******************G\r\n", 
"Customer Name: GLORIA ;      CTN: ************************************      Affected CTN: ************************************      Alternate Number: none      Reason for the Call: change rate plan      Resolution: set exp on changing rate plan      Agent UID: gt***************************g", 
"Customer Name: BRANDY \r\nReason for the Call: payment\r\nRecommendations/Troubleshooting Steps:\r\nResolution: explained card errors\r\nasked for alt # but no good\r\noffered to use different card or refill card\r\nsent qp link\r\nshe will go to bank to check as per cx\r\nCenter Location:\r\nCTN: ************************************\r\nCredit Reason (If credit was applied to account):\r\nAgent UID: jq***************************n\r\n", 
"Customer Name: ; ;Leahi ;  Reason for the Call: ; ; ;billing issue / ******************.****************** dollars only orig. bill / due date shld be *********th / why ****************** now? / ctn change questions ;  Resolution: ; ;explained bill / ctn change info provided ;  Center Location: ;Clark  CTN: ; ;************************************  Alt No:  Credit Reason:  Agent UID: rc***************************w/I-K*********GMDR", 
"Customer Name: MICHAEL\r\nReason for the Call: ACCOUNT BAL INQ\r\n\r\nRecommendations/Troubleshooting Steps:\r\n*Account verified and provided information\r\n\r\nResolution: \r\n*calling about the account status\r\n*suspended due to BBP\r\n*adv about BP policy\r\n*Provided payment options\r\n\r\nCenter Location:Clark\r\nCTN************************************:\r\nCredit Reason (If credit was applied to account):\r\nAgent UID:rc*********\r\n", 
"Customer Name: Kimberly\r\nCTN: ************************************\r\nAffected CTN:************************************\r\nAlternate Number: none\r\nReason for the Call: add mhs otc $******************\r\nResolution: added mhs otc  $******************\r\nAgent UID: gt***************************g\r\n", 
"auto pop / suspended line\r\n-cx speaking spanish\r\n-transfered call\r\n", 
"Customer Name: TERRELL ;  Reason for the Call: payment  Recommendations/Troubleshooting Steps:  Resolution: payment success  test and validated  CTN: ************************************  Credit Reason (If credit was applied to account):  Agent UID: jq***************************n"
)), row.names = c(NA, 10L), class = "data.frame")
Julien
  • 1,613
  • 1
  • 10
  • 26
Rich
  • 101
  • 7
  • Please could you produce a minimal, [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) (probably using `dput()`)? This will help others to solve your problem. Which data do you want to separate? Are the delimiters all the same, or do they differ? – Captain Hat Jul 26 '22 at 08:48
  • 2
    With such messy data it‘s usually not possible to come up with a straightforward splitting scheme. You best bet is probably to split up the column by colon and semicolon or based on a regex pattern and then check if you still need to juggle around some values. – deschen Jul 26 '22 at 08:50
  • @CaptainHat - Delimiters are not same. I have edited the question and added `dput()`. I need all the data from the column which are in the key-value pair in such a way that each key-value pair is splitted across multiple column. If any of the element is not in key-value pair then that needs to be ignored or removed. – Rich Jul 26 '22 at 08:57
  • @deschen - Thanks for your input. I tried to split the column based on delimiter and regex but nothing works for me. I have edited the question. please check, any suggestion will be great help. – Rich Jul 26 '22 at 08:58
  • There are two separate issues here. First is getting your data into some sort of structure where a column contains a key-value pair and only that pair. Second is splitting those pairs into separate columns. You can do the latter easily using `tidyr::separate()`. The former is going to be a lot of manual idiosyncratic cleaning. – socialscientist Jul 26 '22 at 09:29
  • @socialscientist - My dataset is very large more than 4000 record, it is impossible to clean data manually. – Rich Jul 26 '22 at 11:43
  • Manual = writing ad-hoc code unlikely to apply to any other data set. Unless, of course, there is more structure to your data than you have communicated - in which case it should be straightforward! – socialscientist Jul 26 '22 at 11:56

2 Answers2

1

The first step is to find a regex that matches the "separators" of your text. For example ': ' seems to be one of them. Here is a possible regex, but it does not work fully

library(tidyverse)
df = tribble(~ Id, ~ col1,
        "1", "Name: Joe Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua UID: 23456 Location:NY CTN: ************************************",
        "2", "CustName: Kim Putok Ctn: ************************************ Location: CA",
        "3", "Customer Name: GLORIA ;      CTN: ************************************      Affected CTN: ************************************ Location: Egypt",
        "4", "Customer Name: ; ;Leahi ;  Reason for the Call: ; ; ; Location: CA discription: xyz",
        "5", "auto pop / suspended line -cx speaking spanish -transfered call")

regex <- ': |;' # TO BE MODIFIED TO SATISFY THE PROBLEM
lapply(1:nrow(df), \(i) strsplit(df$col1[i], regex))

To find the right regex, use this tool : https://www.autoregex.xyz/

Once the good regex has been found, one of these functions may be useful : strsplit, extract, separate (tidyr package)

It's not over yet, but it's the beginning of a solution

Julien
  • 1,613
  • 1
  • 10
  • 26
1

My advice: rather than treat this dataset as a given, find out more about how it was generated. For instance, the sample you provided seems to have two types of row, one where the name:value pairs are separated by \r\n (these are easy to parse), and one where they are not (these are a lot harder to parse). I suspect there is another column in your dataset that distinguishes these, but you did not provide that. For instance, I notice you provided column 12 of arrange_df. What's in the other columns?

For the first type, in rows 1,2,3,5,7, and 8,

library(data.table)
library(stringi)
#
df <- structure(list(`x` = c("Customer Name: JOE\r\nReason for the Call: BP Set Up\r\nResolution: \r\nvisual audit on the account\r\nset expectations in BP\r\noffered call back\r\nCenter Location: Clark\r\nCTN (Number Calling About): ************************************\r\nAVAYA (Number Calling From): ************************************\r\nAgent UID: cm***************************u\r\n", 
                                            "Name: Kim Putok\r\nCtn: ************************************\r\nReason for calling: lost phone/ ************************************/ follow up on insurance claim/ routed to asurion\r\nResolution:\r\nLocation: Clark\r\nattuid: gb***************************r\r\n", 
                                            "Customer Name: Heather \r\nReason for the Call:    got suspended / card issue / supposedly paid / complains she just updated online her card then got susp\r\nResolution:  \r\nCenter Location: Clark\r\nCTN:  ************************************\r\nAlt No:  ************************************\r\nCredit Reason:  ********* courtesy \r\nAgent UID: rc***************************w/I-EQR******************G\r\n", 
                                            "Customer Name: GLORIA ;      CTN: ************************************      Affected CTN: ************************************      Alternate Number: none      Reason for the Call: change rate plan      Resolution: set exp on changing rate plan      Agent UID: gt***************************g", 
                                            "Customer Name: BRANDY \r\nReason for the Call: payment\r\nRecommendations/Troubleshooting Steps:\r\nResolution: explained card errors\r\nasked for alt # but no good\r\noffered to use different card or refill card\r\nsent qp link\r\nshe will go to bank to check as per cx\r\nCenter Location:\r\nCTN: ************************************\r\nCredit Reason (If credit was applied to account):\r\nAgent UID: jq***************************n\r\n", 
                                            "Customer Name: ; ;Leahi ;  Reason for the Call: ; ; ;billing issue / ******************.****************** dollars only orig. bill / due date shld be *********th / why ****************** now? / ctn change questions ;  Resolution: ; ;explained bill / ctn change info provided ;  Center Location: ;Clark  CTN: ; ;************************************  Alt No:  Credit Reason:  Agent UID: rc***************************w/I-K*********GMDR", 
                                            "Customer Name: MICHAEL\r\nReason for the Call: ACCOUNT BAL INQ\r\n\r\nRecommendations/Troubleshooting Steps:\r\n*Account verified and provided information\r\n\r\nResolution: \r\n*calling about the account status\r\n*suspended due to BBP\r\n*adv about BP policy\r\n*Provided payment options\r\n\r\nCenter Location:Clark\r\nCTN************************************:\r\nCredit Reason (If credit was applied to account):\r\nAgent UID:rc*********\r\n", 
                                            "Customer Name: Kimberly\r\nCTN: ************************************\r\nAffected CTN:************************************\r\nAlternate Number: none\r\nReason for the Call: add mhs otc $******************\r\nResolution: added mhs otc  $******************\r\nAgent UID: gt***************************g\r\n", 
                                            "auto pop / suspended line\r\n-cx speaking spanish\r\n-transfered call\r\n", 
                                            "Customer Name: TERRELL ;  Reason for the Call: payment  Recommendations/Troubleshooting Steps:  Resolution: payment success  test and validated  CTN: ************************************  Credit Reason (If credit was applied to account):  Agent UID: jq***************************n"
)), row.names = c(NA, 10L), class = "data.frame")

setDT(df)[, ID:=seq(.N)]
f <- \(str) {stri_match_all_regex(str, '(.+)\\:(.+)')[[1]]}

parsed <- df[c(1,3,5,7,8), as.data.table(f(x)[, 2:3]), by=.(ID)]
jlhoward
  • 58,004
  • 7
  • 97
  • 140