5

I am trying to extract unique values within each rows of dataframe in R without using for loop.

df <- data.frame(customer = c('joe','jane','john','mary'), fruit = c('orange, apple, orange', NA, 'apple', 'orange, orange'))

df

  customer                 fruit
1      joe orange, apple, orange
2     jane                  <NA>
3     john                 apple
4     mary        orange, orange

What I want for the fruit column is: 'orange, apple', NA, 'apple', 'orange'

  customer                 fruit
1      joe         orange, apple
2     jane                  <NA>
3     john                 apple
4     mary                orange

I tried something along the lines of

apply(df, 1, function(x) unique(unlist(str_split(x[, "fruit"], ", "))))

and it is not working.

How can I get unique values within each row in the dataframe?

zx8754
  • 52,746
  • 12
  • 114
  • 209
ybcha204
  • 91
  • 3

4 Answers4

4

Base R option :

Split the string on comma, keep unique values and paste the values into comma-separated string.

df$fruit <- sapply(strsplit(df$fruit, ',\\s+'), function(x) toString(unique(x)))
df

#  customer         fruit
#1      joe orange, apple
#2     jane            NA
#3     john         apple
#4     mary        orange
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

A simple pipe syntax using dplyr and purrr::map

df %>% mutate(fruit = str_split(fruit, ", "),
              fruit = map(fruit, ~ unique(.x)))
  customer         fruit
1      joe orange, apple
2     jane            NA
3     john         apple
4     mary        orange

or BaseR only

df$fruit <- Map(unique, strsplit(df$fruit, ", "))
df

> df
  customer         fruit
1      joe orange, apple
2     jane            NA
3     john         apple
4     mary        orange

Note: Assumption that every string is separated by a comma and a space as shown in sample

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • I have a quick question, I gave your first suggestion a try and it worked for my case but the outputs that contain commas return the values in the following form c("Case1","Case2"). Can I add another argument so that the output is just a string separated by commas? – Raul Nov 13 '22 at 23:35
  • As an update I tried adding %>% unnest(cols = c(fruit)) but end up losing everything after the first case – Raul Nov 14 '22 at 02:24
0

Updated Solution I just modified my code to match what you would like your output to be.

library(dplyr)
library(tidyr)

df %>%
  separate_rows(fruit) %>%
  distinct(customer, fruit) %>%
  group_by(customer) %>%
  summarise(fruit = paste(sort(fruit, na.last = FALSE), collapse = ", "))

# A tibble: 4 x 2
  customer fruit        
  <chr>    <chr>        
1 jane     NA           
2 joe      apple, orange
3 john     apple        
4 mary     orange

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    I am so new to this site that it took me a while to edit the OP.. Thanks so much for your answer. It looks really promising! Would there be a way to keep the unique values per customer? I edited to show what I want. From your answer, I guess I can group_by(customer) and then go from there? – ybcha204 Apr 09 '21 at 21:22
  • I modified it, so that we have delimited unique values in each row by each customer. – Anoushiravan R Apr 09 '21 at 21:45
0

here is a potential solution using base R, no libraries. Lots of ugly brackets but I think it works..

df$fruit <-lapply(1:nrow(df),function(n)unique(trimws(unlist(strsplit(df$fruit[n],",")))))

output as follows

> df
  customer         fruit
1      joe orange, apple
2     jane            NA
3     john         apple
4     mary        orange
  • 1
    The `apply` Function is designed to apply functions across the rows or columns of an object. `lapply` is for lists you’ve just worked around that. – Daniel O Apr 09 '21 at 21:30