Separating columns using regex used to split the columns into two variables

Question

I have a dataset in which my unit of analysis and variable are together in the same column. I would like to split this up into two variables, but not sure how to because it involves regex.

Say I have the following data:

de <- data.frame(year = c(2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
                          2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012),
                 company = c("Company A - Variable 1", "Company A - Variable 2", "Company A - Variable 3",
                             "CompanyB - Variable 1", "CompanyB - Variable 2", "CompanyB - Variable 3", 
                             "Compan- C - Variable 1", "Compan- C - Variable 2", "Compan- C - Variable 3",
                             "Company A - Variable 1", "Company A - Variable 2", "Company A - Variable 3",
                             "CompanyB - Variable 1", "CompanyB - Variable 2", "CompanyB - Variable 3", 
                             "Compan- C - Variable 1", "Compan- C - Variable 2", "Compan- C - Variable 3"),
                 score = c(1,3,5,7,9,11,13,15,17,2,4,6,8,10,12,14,16,18))

Which looks like:

year company                score
<dbl> <chr>                  <dbl>
2010 Company A - Variable 1     1
2010 Company A - Variable 2     3
2010 Company A - Variable 3     5
2010 CompanyB - Variable 1      7
2010 CompanyB - Variable 2      9
2010 CompanyB - Variable 3     11
2010 Compan- C - Variable 1    13
2010 Compan- C - Variable 2    15
2010 Compan- C - Variable 3    17
2012 Company A - Variable 1     2
2012 Company A - Variable 2     4
2012 Company A - Variable 3     6
2012 CompanyB - Variable 1      8
2012 CompanyB - Variable 2     10
2012 CompanyB - Variable 3     12
2012 Compan- C - Variable 1    14
2012 Compan- C - Variable 2    16
2012 Compan- C - Variable 3    18

The company variable is a combination of company names (in all different forms and shapes), always followed by " - " and then the variable name. I want to wrangle the data so that I arrive at the following structure, splitting up the company and the variable into new columns:


year   company   variable score
2010 Company A Variable 1     1
2010 Company A Variable 2     3
2010 Company A Variable 3     5
2010  CompanyB Variable 1     7
2010  CompanyB Variable 2     9
2010  CompanyB Variable 3    11
2010 Compan- C Variable 1    13
2010 Compan- C Variable 2    15
2010 Compan- C Variable 3    17
2012 Company A Variable 1     2
2012 Company A Variable 2     4
2012 Company A Variable 3     6
2012  CompanyB Variable 1     8
2012  CompanyB Variable 2    10
2012  CompanyB Variable 3    12
2012 Compan- C Variable 1    14
2012 Compan- C Variable 2    16
2012 Compan- C Variable 3    18

I guess then use something to extract the text before " - " (the company name) and after " - " (the variable name). Then sort of split this up into two different variables: company and variable. Any elegant solution would be much appreciated. Thanks

@Sotos I don't think this is really a duplicate. The difficulty is not in separating the columns, but rather the regex used to split the columns. — Allan Cameron, Dec 16 '22 at 14:48
Wouldn't separating at the delimiter ' - ' work? Either way I ll reopen — Sotos, Dec 16 '22 at 15:07

score 1 · Accepted Answer · answered Dec 16 '22 at 14:44

1

You can use tidyr::separate with a lookahead regex to split at the blank space before the string "Variable".

library(tidyr)

separate(de, company, sep = "(?=Variable)", into = c('Company', 'Variable'))
#>    year      Company   Variable score
#> 1  2010 Company A -  Variable 1     1
#> 2  2010 Company A -  Variable 2     3
#> 3  2010 Company A -  Variable 3     5
#> 4  2010  CompanyB -  Variable 1     7
#> 5  2010  CompanyB -  Variable 2     9
#> 6  2010  CompanyB -  Variable 3    11
#> 7  2010 Compan- C -  Variable 1    13
#> 8  2010 Compan- C -  Variable 2    15
#> 9  2010 Compan- C -  Variable 3    17
#> 10 2012 Company A -  Variable 1     2
#> 11 2012 Company A -  Variable 2     4
#> 12 2012 Company A -  Variable 3     6
#> 13 2012  CompanyB -  Variable 1     8
#> 14 2012  CompanyB -  Variable 2    10
#> 15 2012  CompanyB -  Variable 3    12
#> 16 2012 Compan- C -  Variable 1    14
#> 17 2012 Compan- C -  Variable 2    16
#> 18 2012 Compan- C -  Variable 3    18

answered Dec 16 '22 at 14:44

Allan Cameron

147,086
7
49
87

Thanks Allan Cameron. This works on this exemplary dataset. However, I realize now that the variable names in the actual data are in all different forms and shapes as well (e.g., " Ethical Trading Initiative" or "Human Rights Score"). So the crux is basically that I need to separate by anything before " - " and anything after " - ". – Joost Maxen Dec 16 '22 at 15:10
@JoostMaxen but that won't work on your example data set, because the `-` appears in some of the company names. If you can ensure that there are no `-` symbols in your company names, then `sep = "-"` should do the trick. – Allan Cameron Dec 16 '22 at 15:16
Right, I see. Thanks. Actually, in the subset I am currently using there are two: "X-FAB SILICON FOUNDRIES - AB SILICON FOUNDRIES - Fundamental Human Rights ILO or UN" and "CASINO GUICHAR - - Fundamental Human Rights ILO or UN". I assume there's not so much to do but manually check and hope that it's just a handful. If indeed there is nothing to be done, I guess the topic can actually be closed cause it's similar to the other one (?). – Joost Maxen Dec 16 '22 at 15:25
@JoostMaxen yes, if you can find the names of companies with hyphens, it may be best to replace the hyphens in the company names temporarily, then separate the columns at the hyphens, then if required swap the hyphens back in. If there are only a few companies with hyphens then this is feasible either manually or using `gsub` – Allan Cameron Dec 16 '22 at 15:33
But I think your question should remain open, as there are a couple of subtleties that make it different from the linked question. – Allan Cameron Dec 16 '22 at 15:34

Separating columns using regex used to split the columns into two variables

1 Answers1