R - Finding duplicates of words that are in a reversed order

Question

I have a data.table with a column comprising occupational title names. I want to find out repeated occupations but that are written in a reverse order (e.g. writer advertising and advertising writer). Here is a simplified version of my data and the result that I would like to get

data = data.table(
  ID = as.character(c("advertisings writer","writer advertisings","setter","drill setter","setter drill","agent claims","claims agent","engineer"))
)
data_result = data.table(
  ID = as.character(c("advertisings writer","setter","drill setter","agent claims","engineer"))
)

Here is the code that I've been using.

data[,b:= strsplit(ID," ")]

data <- data[,.(b=unlist(b)),by = setdiff(names(data),'b')]
setorderv(data,cols=c("ID","b"))
data <- data[,bb:=list(list(unique(b))),by="ID"][,.SD[1],by=c("ID"),.SDcols=c("bb")]
data[,b:=lapply(bb,paste,collapse=' ')]
data[,b:=unlist(b)]

unique(data,by="b")

Since I am working with quite a large dataset this approach is very time consuming.

Thanks

Waldi · Accepted Answer · 2021-04-05T21:52:48.780

0

A possible solution with data.table:

split the string in words
sort the words
paste the sorted words
get the unique values

library(data.table)

data[,ID:=sapply(sapply(stringr::str_split(ID,' '),sort),function(x) paste(x,collapse=' '))]
unique(data)

                    ID
1: advertisings writer
2:              setter
3:        drill setter
4:        agent claims
5:            engineer

edited Apr 05 '21 at 21:52

answered Apr 05 '21 at 20:46

Waldi

39,242
6
30
78

This works perfectly Waldi, thanks! Quite efficient and very succinct – Miguel Apr 06 '21 at 20:45

score 0 · Answer 2 · answered Apr 05 '21 at 23:03

0

Here is a igraph option

library(dplyr)
library(igraph)

data[, TO := gsub("(\\w+)\\s(\\w+)", "\\2 \\1", ID)] %>%
  graph_from_data_frame(directed = FALSE) %>%
  get.data.frame() %>%
  unique() %>%
  subset(select = from)

which gives

                 from
1 advertisings writer
3              setter
4        drill setter
6        agent claims
8            engineer

answered Apr 05 '21 at 23:03

ThomasIsCoding

96,636
9
24
81

Thanks for the answer Thomas! I thought about this but the problem is that some titles might have more than two words – Miguel Apr 06 '21 at 20:37

R - Finding duplicates of words that are in a reversed order

2 Answers2