I have a dataframe df
that contains a couple of columns, but the only relevant ones are given below.
node | precedingWord
-------------------------
A-bom de
A-bom die
A-bom de
A-bom een
A-bom n
A-bom de
acroniem het
acroniem t
acroniem het
acroniem n
acroniem een
act de
act het
act die
act dat
act t
act n
I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter
, another non-neuter
and a last one rest
. neuter
would contain all values for which precedingWord is one of these values: t
,het
, dat
. non-neuter
would contain de
and die,
and rest
would contain everything that doesn't belong into neuter
or non-neuter
. (It would be nice if this could be dynamic, in other words that rest
uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)
Example output (in a new dataframe, let's say freqDf
, would look like this:
node | neuter | nonNeuter | rest
-----------------------------------------
A-bom 0 4 2
acroniem 3 0 2
act 3 2 1
To create freqDf$node, I can do this:
freqDf<- data.frame(node = unique(df$node), stringsAsFactors = FALSE)
But that's already all I got; I don't know how to continue. I figured I could do something like this, but unfortunately the ++
operator doesn't work as I had hoped.
freqDf$neuter[grep("dat|het|t", df$precedingWord, perl=TRUE)] <- ++
freqDf$nonNeuter[grep("de|die", df$precedingWord, perl=TRUE)] <- ++
e <- table(df$Node)
freqDf$rest <- as.numeric(e - freqDf$neuter - freqDf$nonNeuter)
Also, this won't work for each node individually. I need some sort of loop that automatically runs for each different value in freqDf$node
.