Regex solution for dataframe rownames

Question

I have a dataframe returned from a function that looks like this:

df <- data.frame(data = c(1,2,3,4,5,6,7,8))
rownames(df) <- c('firsta','firstb','firstc','firstd','seconda','secondb','secondc','secondd')

firsta  1
seconda 5
firstb  2
secondb 6

my goal is to turn it into this:

df_goal <- data.frame(first = c(1,2,3,4), second = c(5,6,7,8))
rownames(df_goal) <- c('a','b','c','d')

    first    second
a   1        5
b   2        6

Basically the problem is that there is information in the row names that I can't discard because there isn't otherwise a way to distinguish between the column values.

divibisan · Accepted Answer · 2019-04-19T15:45:35.350

This is a simple long-to-wide conversion; the twist is that we need to generate the key variable from the rownames by splitting the string appropriately.

In the data you present, the rowname consists of the concatination of a "position" (ie. 'first', 'second') and an id (ie. 'a', 'b'), which is stuck at the end. The structure of this makes splitting it complicated: ideally, you'd use a separator (ie. first_a, first_b) to make the separation unambiguous. Without a separator, our only option is to split on position, but that requires the splitting position to be a fixed distance from the start or end of the string.

In your example, the id is always the last single character, so we can pass -1 to the sep argument of separate to split off the last character as the ID column. If that wasn't always true, you would need to some up with a more complex solution to resolve the rownames.

Once you have converted the rownames into a "position" and "id" column, it's a simple matter to use spread to spread the position column into the wide format:

library(tidyverse)
df %>%
    rownames_to_column('row') %>%
    separate(row, into = c('num', 'id'), sep = -1) %>%
    spread(num, data)

  id first second
1  a     1      5
2  b     2      6
3  c     3      7
4  d     4      8

If row ids could be of variable length, the above solution wouldn't work. If you have a known and limited number of "position" values, you could use a regex solution to split the rowname:

Here, we extract the position value by matching to a regex containing all possible values (| is the OR operator).

We match the "id" value by putting that same regex in a positive lookahead operator. This regex will match 1 or more lowercase letters that come immediately after a match to the position value. The downside of this approach is that you need to specify all possible values of "position" in the regex -- if there are many options, this could quickly become too long and difficult to maintain:

df2
         data
firsta      1
firstb      2
firstc      3
firstd      4
seconda     5
secondb     6
secondc     7
secondd     8
secondee    9

df2 %>%
    rownames_to_column('row') %>%
    mutate(num = str_extract(row, 'first|second'),
           id = str_match(row, '(?<=first|second)[a-z]+')) %>%
    select(-row) %>%
    spread(num, data)

  id first second
1  a     1      5
2  b     2      6
3  c     3      7
4  d     4      8
5 ee    NA      9

thanks, yeah luckily both column names in the real data are the same length though the row id's are variable length so should work. — Hugh_Kelley, Apr 19 '19 at 15:35
what should I look at to understand what's going on in `?<=first|second)[a-z]+'`? Is that just standard regex stuff? — Hugh_Kelley, Apr 19 '19 at 17:12
Yeah, look at [this question](https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups) which it explains in more detail. Basically: `(?<=foo)bar` finds the 1st `bar` which has a `foo` before it. `[a-z]` matches a lowercase letter (from `a` to `z`), and the `+` is a quantifier which means that the thing before can match 1 or more times — divibisan, Apr 19 '19 at 17:14

Regex solution for dataframe rownames

1 Answers1