R converting dataframe of strings to unique numbers

Question

I have a dataframe that's very large (let's say 8 rows by 10,000 columns) that is filled with strings. I want to convert each unique string to a number and replace it with it.

For example, if I had a dataframe:

   X1       X2       X3
1 cat    mouse     rabbit
2 dog   cat, dog    dog

I'd like to convert it to:

   X1        X2     X3
1   1         2       3
2   4         5       4

Note the combined label of "cat,dog" gets its own unique number. The actual numbering of each string is irrelevant as I'm doing this for an inter-rater reliability calculation.

Short of me getting all the unique elements, assigning them a number and replacing is there a more elegant way to do this?

Also, if a value in an element is blank, eg "", it should be converted to an NA in the numeric DF.

Maël · Accepted Answer · 2022-11-04T13:21:01.187

5

You can match on the unique values:

df[] <- sapply(df, match, unique(unlist(df)))

#> df
  X1 X2 X3
1  1  3  5
2  2  4  2

Or, even simpler:

df[] <- match(unlist(df), unique(unlist(df)))

edited Nov 04 '22 at 13:21

answered Nov 04 '22 at 13:12

Maël

45,206
3
29
67

1

I was about to post essentially the same answer. The only difference is that you might want to write the result into `df[]` so that the output is still a data frame. – Allan Cameron Nov 04 '22 at 13:14
Ack, one correction, some of those values can be "" which should convert to an NA. I have made the adjustment. – user1357015 Nov 04 '22 at 13:18
No problem, just convert them beforehand; `df[df == ""] <- NA` – Maël Nov 04 '22 at 13:19
1

Actually I think the NA has to be after the conversion to number otherwise it finds the NA and matches there rather than returning an NA. – user1357015 Nov 04 '22 at 13:29

Robert Hacken · Answer 2 · 2022-11-04T13:27:17.447

1

Using factor:

df[] <- as.numeric(factor(unlist(df)))

df
#   X1 X2 X3
# 1  1  4  5
# 2  3  2  3

This is, however, very slow, compared to Maël's solution.

edited Nov 04 '22 at 13:27

answered Nov 04 '22 at 13:15

Robert Hacken

3,878
1
13
15

R converting dataframe of strings to unique numbers

2 Answers2