1

In arbitrary sets of data there are string columns, e. g. species in Iris set. I have to convert those to small integers for ML purposes (matrix operations, so numbers only), and reverse it after calculations. For example: {"setosa" -> 1, "versicolor" -> 2, "virginica" -> 3).

I iterate through columns and check type of each colum (mode of first element). If it's character (only problematic mode), I want to get set of different values from that column (e. g. 3 species in Iris set), map them to consecutive integers (so I'll really have matrix instead of data frame) and reverse it after calculations (e. g. show predicted values in target set as strings, not my arbitrary mapped integers). I think I need a list mapping column index (I don't know in advance which columns will be mapped) to map (string -> integer) for particular column.

pogibas
  • 27,303
  • 19
  • 84
  • 117
qalis
  • 1,314
  • 1
  • 16
  • 44
  • As iris species is already a factor all you need is `as.numeric(iris$Species)`. If column is not a factor, but a character you need to turn it into factor first. – pogibas Jun 21 '19 at 09:00
  • Well, Iris is just an example, I have to work with arbitrary data, e. g. CSV files from UCI repository (function reading CSV reads character columns as... well, characters). I am aware of as.numeric(factor(col)) trick, but for some reason it does not always work - and doesn't provide a map with which I can reverse it AFAIK. – qalis Jun 21 '19 at 09:03
  • Is `data.matrix(iris)` what you want? – jay.sf Jun 21 '19 at 09:12
  • In some way - yes, it does map it. But the problem is reversing the mapping, integers back to string. I can't just create another matrix for calculations (for that this method would be incredibly simple and elegant), since for ML datasets it would be too much memory. – qalis Jun 21 '19 at 09:14
  • Sounds if `data.table` package could be an option, it doesn't create copies when data is manipulated, see e.g. https://stackoverflow.com/a/7813913/6574038 – jay.sf Jun 21 '19 at 09:17
  • try `library(data.table);iris2 <- as.data.table(iris);iris2[, Species:=as.numeric(Species)]` – jay.sf Jun 21 '19 at 09:22
  • Well, it works, but: 1) it creates copy, iris2; 2) it does not allow me to go back to characters. – qalis Jun 21 '19 at 09:49
  • @qualis 1) copy is only needed in MCVE because `iris` is locked 2) do `as.character(as.numeric(.))` – jay.sf Jun 21 '19 at 10:57

1 Answers1

0

Do something like this:

fac <- factor(charvar)
num <- as.numeric(fac)
# Do some manipulation of num, producing newnum
newcharvar <- levels(fac)[newnum]

For example,

>     fac <- factor(iris$Species)
>     num <- as.numeric(fac)
>     head(num)
[1] 1 1 1 1 1 1
>     newnum <- num[c(1, 100)]
>     newnum
[1] 1 2
>     levels(fac)[newnum]
[1] "setosa"     "versicolor"
user2554330
  • 37,248
  • 4
  • 43
  • 90