Find string in table A and replace it with number from table B

Question

I have a table (data frame 1) with tokenized strings. These words need to be replaced with a numerical value from a CSV that I read into R. I used the following commands

library(dplyr)
df1 <- data.frame(tweetsContent, stringsAsFactors = FALSE)
names(df1) <- c('word')
cct <- read.csv('concNorm.csv')  
names(cct) <- c('word','concreteness')
cct <- scan_tokenizer(cct[1])
df2 <- data.frame(cct)
result <- semi_join(df1, df2, by='word')

The error message for both I get is the following:

Error in UseMethod("semi_join"): no applicable method for 'semi_join' applied to an object of class "character".

I have no idea why class character should be a problem as the DPLYR package doesn't specify any data type for the JOIN functions. When loading DPLYR I don't get an error message. I also looked at gsub but all the examples seemed to be replace a certain A with a corresponding B? In my case, A takes on different values, i.e. different words, and has therefore different corresponding values.

The up-dated file can be found here

A reproducible example would go a long way. Have you tried something like `plyr::mapvalues`? — Roman Luštrik, Feb 14 '16 at 10:46
[How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269), [How to join (merge) data frames (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871) — zx8754, Feb 14 '16 at 14:02
@zx8754 I tried these functions but they don't seem to like the data type I have or the data.frame I created. — Simone, Feb 14 '16 at 19:56
@RomanLuštrik great idea with plyr::mapvalues but the output file must be an atomic vector. I tried the following without success:
results <- c() plyr::mapvalues(results,cct,cct, warn_missing=FALSE) It returns an empty vector. The following results <- as.vector(df, mode ='any') plyr::mapvalues(results,cct,cct, warn_missing=FALSE) > Error in plyr::mapvalues(results, cct, cct, warn_missing = FALSE) : > x` must be an atomic vector — Simone, Feb 14 '16 at 22:08
@RomanLuštrik the plyr::mapvalues only works for atomic vectors. R keeps on crashing. I think this simple function is not meant for this large amount of data in an atomic vector? I.e. I am retrieving 500 to 1000 tweets and store them in mydf. — Simone, Feb 19 '16 at 19:49

Joris Meys · Answer 1 · 2016-02-15T16:24:48.650

1

I make the following assumptions:

mydf contains a variable word that contains the tokenized string
cct contains that same variable word with for every tokenized string a value thenumber
Every tokenized string occurs exactly once in the dataframe cct

Then you simply do:

sel.id <- match(mydf$word, cct$word)
mydf$thenumber <- cct$thenumber[sel.id]

This is both easier and quite a lot faster than any merge() or join() solution.

reproducible dataset:

mydf <- data.frame(word = sample(letters[1:4], 10 , replace = TRUE))
cct <- data.frame(word = letters[1:4],
                  thenumber = 1:4)

If you want to replace them, obviously you can just overwrite the original variable by changing the second line to:

mydf$word <- cct$thenumber[sel.id]

edited Feb 15 '16 at 16:24

answered Feb 14 '16 at 12:34

Joris Meys

106,551
31
221
263

yes your assumptions are correct. Just to clarify mydf contains over 1000 words, which all need to be replaced. I tried the code and didn't get an error message. Upon inspection of mydf I saw that the first two lines of code did not replace the words with the numbers from the 'dictionary' cct. – Simone Feb 14 '16 at 19:42
I added the link to the up-dated R file to my initial above post in case you wanted to have a look. – Simone Feb 14 '16 at 20:06
@Simone obviously they don't replace it, as I saved them in a different variable for illustration purposes. If you want to replace them, just overwrite the original one. See update to my answer. – Joris Meys Feb 15 '16 at 09:30
Thanks Joris for the clarification. I checked the newly created variable mydf$thenumber prior to my post. It shows NULL - I assume that it is empty. mydf contains the same words several time whereas each word exists only once in the look up table cct. also the dataframes are of unequal length and width. Maybe that causes the problem? – Simone Feb 15 '16 at 15:51
@Simone the example data I gave has the exact same structure: different sized data frames, every word occurs multiple times in mydf. If it comes empty, then you either made a typo somewhere or you have no matches between mydf and cct. Also remove the nomatch = 0. If there are mismatches, that's going to cause unwanted problems. I wasn't thinking straight. – Joris Meys Feb 15 '16 at 16:25
Thanks for the update Joris. I think I have another issue as your 2 lines of code run through without any error. But I get only NA values in the output after removing the nomatch=0. I have conducted manual checks and some words in mydf are definitively in the look-up table CCT. I tried to format the word variables as strings with as.character and the numbers as numbers with as.numeric but the issue still persists. I am bit at loss here... – Simone Feb 19 '16 at 19:30

score 0 · Accepted Answer · answered Feb 20 '16 at 17:06

So finally I made it work. It seems that other lines of code that I used to clean the string data caused problems with variable types & encoding. As mentioned above adding 'encoding = 'UTF-8' or specifing the variable as string or numeric didn't fix the problem. So I re-wrote some of the cleaning code. Below the code that works.

library('stringr', 'tm', 'dplyr')

df <- data.frame(tweetsText, stringsAsFactors = FALSE)
names(df) <- c('words')
df$words<-gsub("[[:punct:]]", "", df$words) 
df$words <- str_replace_all(df$words,"[^[:graph:]]", " ")
df$words<-tolower(df$words)
df$words <- removeNumbers(df$words)
my.stopwords <- c("house", stopwords("english"))
df$words <- removeWords(df$words, my.stopwords)
words <- strsplit(df$words, split = " ")
df<-data.frame(words = unlist(words))
names(df) <-c('words')
cct <- read.table('concNorm.csv', sep = ",") 
names(cct) <- c('words','concreteness')
tog <- inner_join(df,cct, by ='words')

I haven't been able to fix the sel.id option in my data set, neither with the old 'cleaning code' nor with the new one. I think it will probably work with different data.

Find string in table A and replace it with number from table B

2 Answers2