Remove text between comma and dash in R with regular expressions

Question

I would like to remove text between commas and dashes in a long string of variable labels saved as a comma-separated. Here's a minimal example of my string:

myvarlabels <- ("participant number, How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")

Importantly, the variable labels appear in two different forms and should be shortened in the following way:

How much do you like the following products-green tea
should be reduced to: green tea
\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\"
should be reduced to: \"Japanese, Chinese, and Indian green tea\"

I tried to use gsub and regular expressions to identify and then delete the text between the commas and the dash (i.e., replacing the text with "").

Has anyone a suggestion as to how I could use gsub to remove the text between the commas that indicate the start of a new column and the dashes that are followed by the text that I want to keep while preserving the double quotes?

EDIT 1

To be more precise, the data include three types of comma-separated chunks of text. They all specify what information the corresponding variables contain:

short descriptions including one or more words (e.g., participant number)
longer descriptions where the relevant information only appears after a dash (e.g., How much do you like the following products-green tea)
same as above but with commas present somewhere before the dash (e.g., How much, if anything at all, would you ...); this is why this type of chunk of text is preceded and followed by \" (otherwise they are not correctly read)
same as above but with NO commas before the dash (e.g., How much experience do you have with the following products)

The four types of text sequences are all preceded and followed by commas and can appear in any order.

Here's a new minimal example that more accurately reflects the real data than my first example:

(myvarlabels3 <- ("participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"How much experience do you have with the following products-Indian spices\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\",email,telephone number"))

Cath's code (Edit 2) works up to a certain point. When I add more of the "simple" type 1 sequences of text at the beginning of the string or when I add a text sequence specified under 4. in the above list, the code doesn't work properly anymore.

However, when Cath's code from Edit 2 is run in two steps, then it works perfectly:

myvarlabels3 <- gsub("((?<=,\")[^-]*[^-]+-)|((?<=,\")[^-],*[^-]+-)", "", myvarlabels3, perl=TRUE) # step 1: shorten the text sequences specified under 3. and 4. in the list above

[1] "participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"

gsub("((?<=,)[^-\",]+-)", "", myvarlabels3, perl=TRUE) # step 2: shorten the text sequences specified as 2. in the above list

[1] "participant number,age,gender,body mass index,green tea,beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"

I think it would probably be possible to only use one line of code but I couldn't figure out how. Anyway, this will greatly facilitate my workflow when I import messy csv files from Qualtrics.

A solution will require the beginning of each question to be regular as in the below answer. Or else the problem will need conditionals that are not regex expressions. One comma and dash separate your desired text **sometimes** other times multiple commas separate the text. That is a big difference. — Pierre L, May 09 '16 at 13:51
Thanks a lot for your comment Pierre. Unfortunately, the questions do not all begin with the same words. You suggest that I try to solve it with conditionals. I would greatly appreciate more information on how I could do that. — iunda, May 09 '16 at 13:57
You need a clear beginning and end to the text you want to extract. Do you have a list of the questions alone? Can you isolate the first word of all the questions? Are they all "How" "What" "When" questions? You already have the beginning down. But you need a better way to locate the end of each text extraction. — Pierre L, May 09 '16 at 14:04
I would like to have a function that is applicable as generally as possible. There are two patterns that I can describe quite clearly (see my answer to Cath). It would be great to have a function that can work with this general description of the pattern. — iunda, May 09 '16 at 14:17
It is clear to __you__ how to separate the text, but it is not algorithmic. With `"qrs-tuv,abc,def-ghi,mno,bcd,lmn,hij-wxy,uvx,stu,"`, can you see how there is no pattern to extracting `'tuv" "ghi,mno,bcd" "wxy,uvx"`? — Pierre L, May 09 '16 at 14:26
In your sequence, I'd be interested to delete the following elements: qrs- def- hij- . I think there is a pattern that I would describe as **whenever there is something between a comma and a dash, remove it**. — iunda, May 09 '16 at 14:51
However, this gets complicated by the fact that there are sometimes sequences (actually, they represent coluns in a csv file) that start and end with \" and have several commas (e.g., \"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\"). In this case, the English translation would be: **when there is a dash between two \", then delete everything after the first \" and the dash**. — iunda, May 09 '16 at 14:53

Cath · Accepted Answer · 2016-05-10T07:00:53.963

1

I'm not sure I understand what your desired output is, but you can try spotting the "start of a new column" based on "How much" and then go until you "meet" a dash:

gsub("(^[^,]+, )|(How much[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

EDIT

Considering your patterns, you can try the following:

gsub("((?<=, )[^-\"]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "participant number, green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

I use 2 possible patterns, according to the 2 possible ones you described, with look behinds to specify what should be there but needs to be kept

EDIT2

If you don't have a space between the comma and the question that doesn't begin with quote, you can do:

myvarlabels_2 <- ("participant number,How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")
gsub("((?<=,)[^-\",]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels_2, perl=TRUE)
[1] "participant number,green tea,beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""

edited May 10 '16 at 07:00

answered May 09 '16 at 13:50

Cath

23,906
5
52
86

Thanks a lot for your comment Cath. Unfortunately, the questions do not all begin with "How much". Do you have any idea as to how the code could be adapted so it works for different questions? – iunda May 09 '16 at 13:59
Pattern 1: ***,xyz-keep,*** should become ,keep, – iunda May 09 '16 at 14:14
Pattern 2: ***,\"xyz(,)with at least one comma somewhere...-ke(,)ep,*** should become ,\"ke(,)ep\", (Pattern 2 can have commas either in the part that should be deleted or in the part that I'd like to keep). – iunda May 09 '16 at 14:14
That is not clear as mentioned by Cath and I both. You have to specify how the expression is supposed to know the difference each time. The pattern is not regular. – Pierre L May 09 '16 at 14:15
@Cath that's excellent. The coe works exactly as I desire with the example I provided. Unfortunately, the code doesn't work with my actual data. I'll try to figure out what the difference might be. – iunda May 09 '16 at 15:11
@Cath Thanks again for your help so far! I found out why the code doesn't work on my actual data: My data string does not have a space before the question starting with *How do you like*. I tried to adapt the code you suggested and removed the space in the regular expression. Unfortunately, then it deletes everything between the first comma and the dash. Do you think it would be possible to change the first part of the regular expression so **only the text between the *last comma before* the dash and the dash** is deleted? – iunda May 09 '16 at 15:58
@Cath I'm sorry I didn't provide a more accurate code the first time. I just started coding and don't have much experience with these things. I tested your suggestion, made some changes, and presented a solutation that seems to work with my actual dataset (although this needs some more checking). I hope this will be useful for other users. Thanks so much for your patience and help. It is very much appreciated! – iunda May 10 '16 at 14:38

Remove text between comma and dash in R with regular expressions

1 Answers1