0

I have a dataframe that's very large (let's say 8 rows by 10,000 columns) that is filled with strings. I want to convert each unique string to a number and replace it with it.

For example, if I had a dataframe:

   X1       X2       X3
1 cat    mouse     rabbit
2 dog   cat, dog    dog

I'd like to convert it to:

   X1        X2     X3
1   1         2       3
2   4         5       4

Note the combined label of "cat,dog" gets its own unique number. The actual numbering of each string is irrelevant as I'm doing this for an inter-rater reliability calculation.

Short of me getting all the unique elements, assigning them a number and replacing is there a more elegant way to do this?

Also, if a value in an element is blank, eg "", it should be converted to an NA in the numeric DF.

Maël
  • 45,206
  • 3
  • 29
  • 67
user1357015
  • 11,168
  • 22
  • 66
  • 111

2 Answers2

5

You can match on the unique values:

df[] <- sapply(df, match, unique(unlist(df)))

#> df
  X1 X2 X3
1  1  3  5
2  2  4  2

Or, even simpler:

df[] <- match(unlist(df), unique(unlist(df)))
Maël
  • 45,206
  • 3
  • 29
  • 67
  • 1
    I was about to post essentially the same answer. The only difference is that you might want to write the result into `df[]` so that the output is still a data frame. – Allan Cameron Nov 04 '22 at 13:14
  • Ack, one correction, some of those values can be "" which should convert to an NA. I have made the adjustment. – user1357015 Nov 04 '22 at 13:18
  • No problem, just convert them beforehand; `df[df == ""] <- NA` – Maël Nov 04 '22 at 13:19
  • 1
    Actually I think the NA has to be after the conversion to number otherwise it finds the NA and matches there rather than returning an NA. – user1357015 Nov 04 '22 at 13:29
1

Using factor:

df[] <- as.numeric(factor(unlist(df)))

df
#   X1 X2 X3
# 1  1  4  5
# 2  3  2  3

This is, however, very slow, compared to Maël's solution.

Robert Hacken
  • 3,878
  • 1
  • 13
  • 15