4

Good morning everyone, I have read several things about columns splitting with R but I could not find how to fix my case.

I would like to use the 'separate' function from tidyr R package to split columns of a dataframe in two columns each, according to a separator.

I have this data frame:

dat1 AIN5997 AIN7452 AIN8674 AIN9655 001 01/02 02/02 02/02 01/02 002 01/02 01/01 02/02 02/02 003 01/02 01/02 01/01 02/02 004 01/02 01/01 02/02 01/02 005 01/01 01/01 02/02 02/02 006 01/02 01/02 01/01 02/02 ...

And I would like to separate each column into two according to "/", and if possible while keeping column names (e.g. : AIN5997 would become AIN5997.1 and AIN5997.2)

I think it is possible with 'separate' but I could not extend the procedure to each column of my frame when trying with an 'apply' (probably because separate works with dataframes and a single column at a time). This must be actually very easy but my R skills are quite bad!

There are many threads explaining how to split one column into two, like this one: Split a column of a data frame to multiple columns

but I cannot find how to expand the procedure for several columns simultaneously.

Thank you very much for your help,

All the best :)

Chrys
  • 313
  • 3
  • 10

2 Answers2

3

The trick is to create the new names in the right order, so make sure the columns you want to separate are ordered in advance.

The problem with NA values is that the process cannot split them. So, the trick is to replace them with something you can split. Check this:

library(dplyr)
library(tidyr)

# example dataset
dt = data.frame(id = 1:2,
                AIN5997  = c("01/02", "01/02"),
                AIN7452  = c("02/02", NA),
                AIN8674 = c("02/02","02/02"), stringsAsFactors = F)

# specify columns you want to separate (specify column positions)
input_names = names(dt)[2:4]

# create new names (you want each name twice)
new_names = expand.grid(input_names, 1:2) %>% 
  unite(v, Var1, Var2, sep=".") %>% 
  pull(v) %>% 
  sort()

dt %>%
  unite_("v", input_names) %>%                  # unite columns of interest
  mutate(v = gsub("NA", "NA/NA", v)) %>%        # replace NAs with something that can be separated
  separate(v, new_names, convert = F)           # separate elements and give new names

#   id AIN5997.1 AIN5997.2 AIN7452.1 AIN7452.2 AIN8674.1 AIN8674.2
# 1  1        01        02        02        02        02        02
# 2  2        01        02        NA        NA        02        02

I'm also adding a better solution. It automatically treats NA values and you don't have to worry about column names and their order.

library(dplyr)
library(tidyr)
library(purrr)

# example dataset
dt = data.frame(id = 1:2,
                AIN5997  = c("01/02", "01/02"),
                AIN7452  = c("02/02", NA),
                AIN8674 = c("02/02","02/02"), stringsAsFactors = F)

# separate a given column of your initial dataset
f = function(x) { dt %>% select_("id", x) %>% separate_(x, paste0(x, c(".1",".2"))) }


names(dt)[2:4] %>%             # get names of columns you want to separate
  map(f) %>%                   # apply the function above to each name (will create a list of dataframes)
  reduce(left_join, by="id")   # join dataframes iteratively

#   id AIN5997.1 AIN5997.2 AIN7452.1 AIN7452.2 AIN8674.1 AIN8674.2
# 1  1        01        02        02        02        02        02
# 2  2        01        02      <NA>      <NA>        02        02
AntoniosK
  • 15,991
  • 2
  • 19
  • 32
  • Thanks a lot Antonio, this is working perfectly well, apart from one little issue: I sometimes have missing values in my genotypes, coded as (sorry, I should have mentioned that before), and it seems that the command could not split them into two new values. Also I am curious: how does the command know it has to split main values according to "/" :)? Thank you so much again! – Chrys Nov 01 '17 at 13:41
  • You can see that the command knows to split using any non-alphanumeric value. Check here: https://rdrr.io/cran/tidyr/man/separate.html , or type `?separate`. – AntoniosK Nov 01 '17 at 13:43
  • 1
    Thanks Antonio, I'll replace my NA values with "00/00" and have a look at your link! Best regards, Chrys – Chrys Nov 01 '17 at 13:50
  • Yes, that will work. I'm following the same approach (i.e. replacing with "NA/NA"). I've made a small but useful change. I've added `convert = .` within the `separate` command, which will make the new columns numeric, by understanding you have numerical values if you set it to `T`. In that case you'll lose the zero in front of your values. – AntoniosK Nov 01 '17 at 13:55
  • 1
    Great :-) This is exactly what I needed, many thanks for your kind help! Have a very nice day, Chrys – Chrys Nov 01 '17 at 13:58
  • Cool. I'll leave it as `convert = F` in the answer in order to match exactly my output. Feel free to use `convert = T` if you prefer numerical values and you don't care about losing the zeros (i.e. having 2 instead of 02). – AntoniosK Nov 01 '17 at 14:01
  • @Chrys please check the last version (using `map` and `reduce`) as I think is much better and easier to generalise in future cases. If this works well I'd like to remove the previous versions in order to keep the answer short and to the point. – AntoniosK Nov 01 '17 at 14:31
  • this does work as well! Indeed this factorized version might be of wide use :), although it implies to have no duplicated "id" names (?) while the previous version enabled to split whatever "id" was. – Chrys Nov 01 '17 at 15:29
  • Indeed, but it depends how you define your "id" column. Code could be adjusted to use multiple rows as a unique identifier of a row. Or, you can create an "id" column that has the number of each row. – AntoniosK Nov 01 '17 at 16:35
1

You can also use tstrsplit().

# example dataset
df <- data.frame(AIN5997  = c("01/02", "01/02"),
                 AIN7452  = c("02/02","01/01"),
                 AIN8674 = c("02/02","02/02"), stringsAsFactors = F)
df
df2 <- as.data.frame(unlist(lapply(df, data.table::tstrsplit, "/"),
                            recursive = FALSE))
df2
colnames(df2) # change colnames
colnames(df2) <- paste(substr(colnames(df2), 1, nchar(colnames(df2))-1),
                       substr(colnames(df2), nchar(colnames(df2)), nchar(colnames(df2))),
                       sep = ".")
df2
nghauran
  • 6,648
  • 2
  • 20
  • 29