Turn dataframe into frequency list with two column variables in Python

Question

I have a dataframe that consists of columns node, component, and preceding word. Node contains many identical values (alphabetically sorted), component contains many identical values as well, but scrambled, and preceding word can be all kind of words - but some identical as well.

What I want to do now is create some sort of cross-section/frequency list that shows the frequency of the component and the preceding word, linked to the node.

Let's say this is my df:

node    precedingWord comp
banana  the           lel
banana  a             lel
banana  a             lal
coconut some          lal
coconut few           lil
coconut the           lel

I am expecting a frequency list that shows each unique node, and the times that some value is found in the other columns given matching criteria, e.g.

det1 = a
det2 = the
comp1 = lel
comp2 = lil
comp 3 = lal

expected output:

node    det1  det2 unspecified comp1 comp2 comp3
banana  2     1    0           2     0     1
coconut 0     1    0           1     1     1

I already did it for one variable, but I don't know how to get the comp columns in place:

det1 = ["a"]
det2 = ["the"]

df.loc[df.preceding_word.isin(det1), "determiner"] = "det1"
df.loc[df.preceding_word.isin(det2), "determiner"] = "det2"
df.loc[df.preceding_word.isin(det1 + det2) == 0, "determiner"] = "unspecified"

# Create crosstab of the node and gender
freqDf = pd.crosstab(df.node, df.determiner)

I got this answer from here. If anyone could explain what the loc does, that would be a lot of help as well.

With Andy's answer in mind, I tried the following. Note that "precedingWord" has been replaced by "gender", which holds only values neuter, non_neuter, gender.

def frequency_list():
    # Define content of gender classes
    neuter = ["het"]
    non_neuter = ["de"]

    # Add `gender` column to df
    df.loc[df.preceding_word.isin(neuter), "gender"] = "neuter"
    df.loc[df.preceding_word.isin(non_neuter), "gender"] = "non_neuter"
    df.loc[df.preceding_word.isin(neuter + non_neuter) == 0, "gender"] = "unspecified"

    g = df.groupby("node")

    # Create crosstab of the node, and gender and component
    freqDf = pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)])

    # Reset indices, starting from 1, not the default 0!
    """ Crosstabs don't come with index, so we first set the index with
    `reset_index` and then alter it. """
    freqDf.reset_index(inplace=True)
    freqDf.index = np.arange(1, len(freqDf) + 1)

    freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")

The output is close to what I want, but not exactly:

The crosstabs aren't "merged", in other words: first the rows are displayed for comp (component), then the same nodes are displayed for gender.
Empty values ought to be 0.
All values ought to be integers, no floats.

What I would want, then, is this:

Do note that I'm looking for the most efficient answer. I'm actually working with loads, and loads of data so each second per loop counts!

Andy Hayden · Accepted Answer · 2015-11-27T03:05:04.200

Update: This is a crosstab:

In [11]: df1 = pd.crosstab(df['node'], df['precedingWord'])

In [12]: df1
Out[12]:
precedingWord  a  few  some  the
node
banana         2    0     0    1
coconut        0    1     1    1

In [13]: df2 = pd.crosstab(df['node'], df['comp'])

which is clearly a cleaner (and more efficient algorithm for large data).

Then glue them up with a concat with axis=1 (i.e. add as more columns rather add as more rows).

In [14]: pd.concat([df1, df2], axis=1, keys=['precedingWord', 'comp'])
Out[14]:
        precedingWord              comp
                    a few some the  lal lel lil
node
banana              2   0    0   1    1   2   0
coconut             0   1    1   1    1   1   1

I would probably leave it like this (as a MultiIndex), if you want it flattened just don't pass the keys (though there may be an issue with duplicated words):

In [15]: pd.concat([df1, df2], axis=1)
Out[15]:
         a  few  some  the  lal  lel  lil
node
banana   2    0     0    1    1    2    0
coconut  0    1     1    1    1    1    1

Aside: it would be nice if concat didn't require that the column name be passed in explicitly (as the key kwarg) when they exist...

Original answer

You can use for value_counts:

In [21]: g = df.groupby("node")

In [22]: g["comp"].value_counts()
Out[22]:
node     comp
banana   lel     2
         lal     1
coconut  lal     1
         lel     1
         lil     1
dtype: int64

In [23]: g["precedingWord"].value_counts()
Out[23]:
node     precedingWord
banana   a                2
         the              1
coconut  few              1
         some             1
         the              1
dtype: int64

Putting this in a single frame is a little trickier:

In [24]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)])
Out[24]:
          a  few  lal  lel  lil  some  the
node
banana  NaN  NaN    1    2  NaN   NaN  NaN
coconut NaN  NaN    1    1    1   NaN  NaN
banana    2  NaN  NaN  NaN  NaN   NaN    1
coconut NaN    1  NaN  NaN  NaN     1    1

In [25]: pd.concat([g["comp"].value_counts().unstack(1), g["precedingWord"].value_counts().unstack(1)]).fillna(0)
Out[25]:
         a  few  lal  lel  lil  some  the
node
banana   0    0    1    2    0     0    0
coconut  0    0    1    1    1     0    0
banana   2    0    0    0    0     0    1
coconut  0    1    0    0    0     1    1

You can map the columns to det1, det2, etc. before doing the concat, for example if you had the mapping as a dictionary.

In [31]: res = g["comp"].value_counts().unstack(1)

In [32]: res
Out[32]:
comp     lal  lel  lil
node
banana     1    2  NaN
coconut    1    1    1

In [33]: res.columns = res.columns.map({"lal": "det1", "lel": "det2", "lil": "det3"}.get)

In [34]: res
Out[34]:
         det1  det2  det3
node
banana      1     2   NaN
coconut     1     1     1

Alternatively you could use a list comprehension (if you didn't have the dict or have particular labels in mind):

In [41]: res = g["comp"].value_counts().unstack(1)

In [42]: res.columns = ['det%s' % i for i, _ in enumerate(df.columns)]

Hi there! Long time no activity for me. But I'm back. Thanks for the suggestions. I tried your first approach, but I can't get it to work. Please see my edit. Considering I'm actually working with an enormously large dataset, I'd like to get the most efficient approach. — Bram Vanroy, Nov 25 '15 at 10:31
@BramVanroy is this solved by using pd.concat with axis=1 i.e. `pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)], axis=1)`. You might also like to checkout `pd.crosstab`. Updating answer with crosstab! — Andy Hayden, Nov 25 '15 at 15:47
@BramVanroy If that doesn't help for the final join/concat aspect you might get more activity via asking a new question about that specific part (otherwise I'll try and answer next time I'm online). — Andy Hayden, Nov 25 '15 at 15:51
Considering I just added a bounty to the question this morning, I'd rather see this answered here. +1 for your efforts so far though! I did already get the crosstab functionality (see my very first post), but I don't know how to add the second factor (component) to it. In other words, I don't know how to solve that concatenation problem. — Bram Vanroy, Nov 25 '15 at 18:25
It did, but apparently I missed your comment! Thanks. I set the values to integers as well with `.astype(int)`, but I'm wondering if there's nothing more efficient/logical we could do. Can't you create a df that has integer values only? To me it seems that now we first create a frame, fill it with values, and then go over each cell again and convert it to int. Seems like overhead? ([current code](http://pastebin.com/96pUbqRn)) — Bram Vanroy, Nov 26 '15 at 21:51
@BramVanroy but you don't have to do that if you use exclusively crosstab and concat. There's no NaNs, that need filling/astypeing if you use crosstab. RIght? — Andy Hayden, Nov 27 '15 at 02:53
You're absolutely right! Everything s now working as I had hoped! — Bram Vanroy, Nov 29 '15 at 15:11

Dima Lituiev · Answer 2 · 2015-11-27T08:34:14.657

One can split your question in at least three:

how to group and pivot the table?
how to merge the table?
what is loc doing?

General note

Pandas provides speed up for some operations, so try library implementation before resorting to loops (see below)

Pivoting

1. With plain pandas:

df = pd.DataFrame({"det":["a","the","a","a","a", "the"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
"you will need a dummy variable:"
df["counts"] = 1
"you probably need to reset the index"
df_counts = df.groupby(["det","word"]).agg("count").reset_index()
#   det word  counts
#0    a  cat       2
#1    a  pet       3
#2  the  pet       1
"and pivot it"
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det   a  the
#word        
#cat   2    0
#pet   3    1

Example with two columns:

df = pd.DataFrame([['idee', 'het', 'lel', 1],
   ['idee', 'het', 'lel', 1],
   ['idee', 'de', 'lal', 1],
   ['functie', 'de', 'lal', 1],
   ['functie', 'de', 'lal', 1],
   ['functie', 'en', 'lil', 1],
   ['functie', 'de', 'lel', 1],
   ['functie', 'de', 'lel', 1]],
 columns = ['node', 'precedingWord', 'comp', 'counts'])
df["counts"] = 1
df_counts = df.groupby(["node","precedingWord", "comp"]).agg("count").reset_index()

df_counts
#      node precedingWord comp  counts
#0  functie            de  lal       2
#1  functie            de  lel       1
#2  functie            de  lil       1
#3  functie            en  lil       1
#4     idee            de  lal       1
#5     idee           het  lel       2

2. Using Counter

df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})
acounter = Counter( (tuple(x) for x in df.as_matrix()) )
#Counter({('a', 'cat'): 2, ('a', 'pet'): 2, ('the', 'pet'): 2})
df_counts = pd.DataFrame(list(zip([y[0] for y in acounter.keys()], [y[1] for y in acounter.keys()], acounter.values())), columns=["det", "word", "counts"])
#   det word  counts
#0    a  cat       2
#1  the  pet       1
#2    a  pet       3
df_counts.pivot( index = "word", columns = "det", values="counts").fillna(0)
#det   a  the
#word        
#cat   2    0
#pet   3    1

This one is a bit faster in my case than pure pandas (52.6 µs vs 92.9 µs per loop for grouping; not counting pivoting)

3. As I understand it is a natural language processing problem. You can try combining all your data into a string and using CountVectorizer from sklearn and setting ngram_range=(1, 2). Something like:

df = pd.DataFrame({"det":["a","the","a","a","a", "a"], "word":["cat", "pet", "pet", "cat","pet", "pet"]})

from sklearn.feature_extraction.text import CountVectorizer
listofpairs = []
for _, row in df.iterrows():
    listofpairs.append(" ".join(row))

countvect = CountVectorizer(ngram_range=(2,2), min_df = 0.0, token_pattern='(?u)\\b\\w+\\b')
sparse_counts = countvect.fit_transform(listofpairs)

print("* input list:\n",listofpairs)
print("* array of counts:\n",sparse_counts.toarray())
print("* vocabulary [order of columns in the sparse array]:\n",countvect.vocabulary_)

counter_keys = [x[1:] for x  in sorted([ tuple([v] + k.split(" ")) for k,v in countvect.vocabulary_.items()])]
counter_values = np.sum(sparse_counts.toarray(), 0)

df_counts = pd.DataFrame([(x[0], x[1], y) for x,y in  zip(counter_keys, counter_values)], columns=["det", "word", "counts"])

Merging

Two options: 1. concat df1.set_index("word") df2.set_index("word") dfout = pd.concat([df1, df2], axis = 1)

2.merge

`loc`

It indexes rows (with one argument) or row,column with two arguments. It uses row/column name or boolean indexing (as in your case for rows).

If you have only one article per gender, you can use direct comparison instead of in operation, which might bring some speed up:

df.loc[df.precedingWord.isin(neuter), "gender"] = "neuter"

with

indices_neutral = df["precedingWord"]=="de" 
df.loc[indices, "gender"] = "neuter"

or shorter but less readable

df.loc[df["precedingWord"]=="de", "gender"] = "neuter"