2

I have a dataframe df that contains a couple of columns, but the only relevant ones are given below.

node    |   precedingWord
-------------------------
A-bom       de
A-bom       die
A-bom       de
A-bom       een
A-bom       n
A-bom       de
acroniem    het
acroniem    t
acroniem    het
acroniem    n
acroniem    een
act         de
act         het
act         die
act         dat
act         t
act         n

I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter, another non-neuter and a last one rest. neuter would contain all values for which precedingWord is one of these values: t,het, dat. non-neuter would contain de and die, and rest would contain everything that doesn't belong into neuter or non-neuter. (It would be nice if this could be dynamic, in other words that rest uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)

Example output (in a new dataframe, let's say freqDf, would look like this:

node    |   neuter   | nonNeuter   | rest
-----------------------------------------
A-bom       0          4             2
acroniem    3          0             2
act         3          2             1

To create freqDf$node, I can do this:

freqDf<- data.frame(node = unique(df$node), stringsAsFactors = FALSE)

But that's already all I got; I don't know how to continue. I figured I could do something like this, but unfortunately the ++ operator doesn't work as I had hoped.

freqDf$neuter[grep("dat|het|t", df$precedingWord, perl=TRUE)] <- ++
freqDf$nonNeuter[grep("de|die", df$precedingWord, perl=TRUE)] <- ++

e <- table(df$Node)
freqDf$rest <- as.numeric(e - freqDf$neuter - freqDf$nonNeuter)

Also, this won't work for each node individually. I need some sort of loop that automatically runs for each different value in freqDf$node.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • An ugly solution with `data.table`: `dt<-as.data.table(df);dt[,list(neuter=sum(precedingWord %in% c("t","het","dat")),nonNeuter=sum(precedingWord %in% c("de","die")),rest=sum(!precedingWord %in% c("t","het","dat","de","die"))),by=node]`. – nicola Apr 02 '15 at 10:05
  • @nicola I get the response dat `as.data.table` isn't a function. – Bram Vanroy Apr 02 '15 at 11:19
  • Sorry, I didn't specify that you need `data.table` package. Install it and then put `require(data.table)` on top of what I wrote. – nicola Apr 02 '15 at 11:34

2 Answers2

1

One way is to replace the values by their categories and then use the tablefunction to generate the frequecies.

neuter <- c("t", "het", "dat")
non.neuter <- c("de", "die")

df$precedingWord[df$precedingWord %in% neuter] <- "neuter"
df$precedingWord[df$precedingWord %in% non.neuter] <- "non.neuter"
df$precedingWord[!df$precedingWord %in% c(neuter, non.neuter)] <- "rest"

table(df)

      precedingWord
  node       neuter non.neuter rest
  A-bom         0          4    2
  acroniem      3          0    2
  act           3          2    1

But I'm sure there is a better solution with the dplyr package for example.

EDIT : Maybe something like that : (It dont overwrite your "precedingWord" column but add a new "gender" one)

library(dplyr)
df %>%
  mutate(gender = ifelse(!precedingWord %in% c(neuter, non.neuter), "rest", 
                         ifelse(precedingWord %in% neuter, "neuter", "non.neuter"))) %>%
  count(node, gender)


Source: local data frame [7 x 3]
Groups: node

      node     gender n
1    A-bom non.neuter 4
2    A-bom       rest 2
3 acroniem     neuter 3
4 acroniem       rest 2
5      act     neuter 3
6      act non.neuter 2
7      act       rest 1

# And if you want the same output you put in your question, you can use table
df2 <- mutate(df, gender = ifelse(!precedingWord %in% c(neuter, non.neuter), "rest", 
                       ifelse(precedingWord %in% neuter, "neuter", "non.neuter")))

table(df2$node, df2$gender)

           neuter non.neuter rest
  A-bom         0          4    2
  acroniem      3          0    2
  act           3          2    1

Edit : Convert table to a manipulable data frame

myTable <- table(df2$node, df2$gender) %>% 
  as.data.frame.matrix %>%
  mutate(node = row.names(.))

 > myTable
  neuter non.neuter rest     node
1      0          4    2    A-bom
2      3          0    2 acroniem
3      3          2    1      act
> str(myTable)
'data.frame':   3 obs. of  4 variables:
 $ neuter    : int  0 3 3
 $ non.neuter: int  4 0 2
 $ rest      : int  2 2 1
 $ node      : chr  "A-bom" "acroniem" "act"

# And here is a more understandable way if you are not familiar with piping
# To learn more about forward piping : https://github.com/smbache/magrittr 
myTable <- table(df2$node, df2$gender)
myTable2 <- as.data.frame.matrix(myTable)
myTable3 <- mutate(myTable2, node = row.names(myTable2))
Julien Navarre
  • 7,653
  • 3
  • 42
  • 69
  • This looks very promising. However, it'd be better to put neuter, non.neuter and rest in a new column called `gender` I think. I don't want to overwrite the values in `precedingWord`. More importantly, though, I can't table df as a whole because it contains many more columns. Would it be better to clone `df` after finishing all operations, and then removing all unwanted columns and then calling the table function? – Bram Vanroy Apr 02 '15 at 11:21
  • Also, shouldn't the definition of `rest` refer to the neuter and non-neuter variables themselves, rather than the string? I.e. `...c(neuter, non.neuter)] <- "rest"`? – Bram Vanroy Apr 02 '15 at 11:41
  • If you change for `c(neuter, non.neuter)` instead of the strings, it will replace all by `rest` because now values are equals to "neuter" and "non.neuter" so `!df$precedingWord %in% c(neuter, non.neuter)` will return all TRUE because "neuter" or "non.neuter" aren't values in the `neuter` or `non .neuter` objects – Julien Navarre Apr 02 '15 at 12:29
  • My test case proves you wrong, though. by using `c(neuter, non.neuter)` it will look inside `neuter <- c("t", "het", "dat") non.neuter <- c("de", "die")` and not simply for the strings `"neuter"`, "non.neuter". When I used `c("neuter", "non.neuter")`, rest was always true. When using `c(neuter, non.neuter)`, it isn't. – Bram Vanroy Apr 02 '15 at 12:32
  • I suppose that you don't replace directly the "precedingWord" values by the strings then.. Anyway, I edited with a dplyr solution which maybe is better. – Julien Navarre Apr 02 '15 at 12:46
  • Ah yes, this works great! The only problem I have with this, is that the words (column with µ A-bom`, `acroniem` and so on) in it, doesn't have a column name. I'd like to specify it as `node`. So `node`, `neuter`, `non.neuter` and `rest` are all on the same level. Also, is it possible to convert that table to a dataframe to more easily work wit hit? Thanks! (+1) – Bram Vanroy Apr 02 '15 at 13:20
  • Thanks, this is perfect. I'm still learning R. I wonder, though, does `count(node, gender)` do anything else but display the count? In other words, if I only need the table as a data frame, I can safely use the code in your second edit, without the one in your first edit, right? Edit: I mean, without the first command in your first edit. – Bram Vanroy Apr 02 '15 at 14:43
1

R usually doesn't require looping. It's designed to act on all elements of a data structure using vectors and the apply commands. In this case you don't need to use tapply because the table function already does what you want.

Julien's answer works for your example, but in the (probably unusual) case that no words of a given type are present, it will fail. For example, if you had no "neuter" words then "neuter" would be missing from the table instead of showing all zeroes as expected. To deal with this, you can treat word type as a factor.

Note that in the code below, I added a fourth type of word ("nonword") to demonstrate the zero-words case.

df<-as.data.frame(matrix(c("A-bom","de","A-bom","die","A-bom","de","A-bom","een","A-bom","n","A-bom","de","acroniem","het","acroniem","t","acroniem","het","acroniem","n","acroniem","een","act","de","act","het","act","die","act","dat","act","t","act","n"), byrow=T, ncol=2), stringsAsFactors=F)
names(df)<-c("node", "precedingWord")

# dictionary of word types. 
# I added a fourth type of word to demonstrate what happens 
# if no words of a given type are present.
classes<-c("t"="neuter", "het"="neuter" ,"dat"="neuter", "de"="non-neuter", "die"="non-neuter", "blorble"="nonword")

# create class variable and initialize to "rest"
df$class<-"rest"
df$class<-ifelse(!is.na(classes[df$precedingWord]), classes[df$precedingWord], "rest")

# note fourth category, "nonword", is missing.
table(df$node, df$class)

# make sure any missing categories are still possible levels for class
df$class<-factor(df$class)
levels(df$class)<-c(levels(df$class), unique(classes))

#now non-represented categories are still there. 
table(df$node, df$class)
octern
  • 4,825
  • 21
  • 38
  • My dataset is so enormous that it is improbable that there won't be words of a certain type. However, I do keep your information in mind! – Bram Vanroy Apr 02 '15 at 12:18