1

I am trying to make a frequency table based on a dataframe with pandas and Python. In fact it's exactly the same as a previous question of mine which used R.

Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows):

node    |   precedingWord
-------------------------
A-bom       de
A-bom       die
A-bom       de
A-bom       een
A-bom       n
A-bom       de
acroniem    het
acroniem    t
acroniem    het
acroniem    n
acroniem    een
act         de
act         het
act         die
act         dat
act         t
act         n

I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter, another non-neuter and a last one rest. neuter would contain all values for which precedingWord is one of these values: t,het, dat. non-neuter would contain de and die, and rest would contain everything that doesn't belong into neuter or non-neuter. (It would be nice if this could be dynamic, in other words that rest uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)

Example output (in a new dataframe, let's say freqDf, would look like this:

node    |   neuter   | nonNeuter   | rest
-----------------------------------------
A-bom       0          4             2
acroniem    3          0             2
act         3          2             1

I found an answer to a similar question but the use case isn't exactly the same. It seems to me that in that question all variables are independent. However, in my case it is obvious that I have multiple rows with the same node, which should all be brought down to a single one frequency - as show in the expected output above.

I thought something like this (untested):

def specificFreq(d):  
    for uniqueWord in d['node']
        return pd.Series({'node': uniqueWord ,
            'neuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 't|het|dat'),
            'nonNeuter':  sum(d['node' == uniqueWord] & d['precedingWord'] == 'de|die'),
            'rest': len(uniqueWord) - neuter - nonNeuter}) # Length of rows with the specific word, distracted by neuter and nonneuter values above

df.groupby('node').apply(specificFreq)

But I highly doubt this the correct way of doing something like this.

Community
  • 1
  • 1
Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239

1 Answers1

2

As proposed in the R solution, you can first change the name and then perform the cross tabulation:

df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"
df.loc[df.precedingWord.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.precedingWord.isin(neuter + non_neuter)==0, "gender"] = "rest"
# neuter + non_neuter is the concatenation of both lists.

pd.crosstab(df.node, df.gender)
gender    neuter  non_neuter  rest
node                              
A-bom          0           4     2
acroniem       3           0     2
act            3           2     1

This one is better because if a word in neuter or non_neuter is not present in precedingword, it won't raise a KeyError like in the former solution.


Former solution, less clean.

Given your dataframe, you can make a simple cross tabulation:

ct = pd.crosstab(df.node, df.precedingWord) 

which gives:

pW        dat  de  die  een  het  n  t
node                                  
A-bom       0   3    1    1    0  1  0
acroniem    0   0    0    1    2  1  1
act         1   1    1    0    1  1  1

Then, you just want to sum certain columns together:

neuter = ["t", "het", "dat"]
non_neuter = ["de","die"]
freqDf = pd.DataFrame()

freqDf["neuter"] = ct[neuter].sum(axis=1)
ct.drop(neuter, axis=1, inplace=1)

freqDf["non_neuter"] = ct[non_neuter].sum(axis=1)
ct.drop(non_neuter, axis=1, inplace=1)

freqDf["rest"] = ct.sum(axis=1)

Which gives you freqDf:

          neuter  non_neuter  rest
node                              
A-bom          0           4     2
acroniem       3           0     2
act            3           2     1

HTH

jrjc
  • 21,103
  • 9
  • 64
  • 78
  • Thanks. I think that I understand what's going on: create a frequency table of ALL words. Then, from that tablemake sums. First the neuter cases, then nonneuter. After each operation, drop all relevant columns, then finally count all remaining columns. Correct? One question though. What does `axis=1` do? Also, I quickly tried this in Python 3.4.3 and I got the error that freqDf isn't defined. Should I first create a new table named freqDf? – Bram Vanroy Aug 19 '15 at 13:51
  • @BramVanroy, yes it's exactly what it's doing. `axis=1` is to choose the dimension (either columns or rows, so 1 is for column and 0 is rows) along which you'll do something. – jrjc Aug 19 '15 at 13:57
  • @BramVanroy, I edited for `freqDf`, I forget to copy/past a line, but yes, you need to declare it. – jrjc Aug 19 '15 at 13:57
  • This is great sofar, however there is an issue. When using either solution, when an element in `neuter` isn't found, Python will throw an index error. As an example: change neuter to `neuter = ["het", "banana"]`. Though a fictious example, it can cause problems down the road. – Bram Vanroy Aug 20 '15 at 09:19
  • @BramVanroy, there is indeed a `KeyError` with the first solution, but not with the second. – jrjc Aug 20 '15 at 09:41
  • Ah yes, you are right. I'll accept your answer. The second approach is even faster, I noticed. Do you know why there is no KeyError in the second option? Because the [docs](http://pandas.pydata.org/pandas-docs/stable/indexing.html) state that *.loc will raise KeyError when the items are not found.* – Bram Vanroy Aug 20 '15 at 09:51
  • There is no KeyError because it is testing whether words in `precedingWord` are in a given list. `df.precedingWord.isin(neuter)` is just a Series of True or False (results of the previous test `isin`), and pandas will just access True indexes with `loc` – jrjc Aug 20 '15 at 09:57