I have a dataframe that consists of columns node, component, and preceding word. Node contains many identical values (alphabetically sorted), component contains many identical values as well, but scrambled, and preceding word can be all kind of words - but some identical as well.
What I want to do now is create some sort of cross-section/frequency list that shows the frequency of the component and the preceding word, linked to the node.
Let's say this is my df:
node precedingWord comp
banana the lel
banana a lel
banana a lal
coconut some lal
coconut few lil
coconut the lel
I am expecting a frequency list that shows each unique node, and the times that some value is found in the other columns given matching criteria, e.g.
det1 = a
det2 = the
comp1 = lel
comp2 = lil
comp 3 = lal
expected output:
node det1 det2 unspecified comp1 comp2 comp3
banana 2 1 0 2 0 1
coconut 0 1 0 1 1 1
I already did it for one variable, but I don't know how to get the comp columns in place:
det1 = ["a"]
det2 = ["the"]
df.loc[df.preceding_word.isin(det1), "determiner"] = "det1"
df.loc[df.preceding_word.isin(det2), "determiner"] = "det2"
df.loc[df.preceding_word.isin(det1 + det2) == 0, "determiner"] = "unspecified"
# Create crosstab of the node and gender
freqDf = pd.crosstab(df.node, df.determiner)
I got this answer from here. If anyone could explain what the loc
does, that would be a lot of help as well.
With Andy's answer in mind, I tried the following. Note that "precedingWord" has been replaced by "gender", which holds only values neuter, non_neuter, gender.
def frequency_list():
# Define content of gender classes
neuter = ["het"]
non_neuter = ["de"]
# Add `gender` column to df
df.loc[df.preceding_word.isin(neuter), "gender"] = "neuter"
df.loc[df.preceding_word.isin(non_neuter), "gender"] = "non_neuter"
df.loc[df.preceding_word.isin(neuter + non_neuter) == 0, "gender"] = "unspecified"
g = df.groupby("node")
# Create crosstab of the node, and gender and component
freqDf = pd.concat([g["component"].value_counts().unstack(1), g["gender"].value_counts().unstack(1)])
# Reset indices, starting from 1, not the default 0!
""" Crosstabs don't come with index, so we first set the index with
`reset_index` and then alter it. """
freqDf.reset_index(inplace=True)
freqDf.index = np.arange(1, len(freqDf) + 1)
freqDf.to_csv("dataset/py-frequencies.csv", sep="\t", encoding="utf-8")
The output is close to what I want, but not exactly:
- The crosstabs aren't "merged", in other words: first the rows are displayed for comp (
component
), then the same nodes are displayed forgender
. - Empty values ought to be
0
. - All values ought to be integers, no floats.
What I would want, then, is this:
Do note that I'm looking for the most efficient answer. I'm actually working with loads, and loads of data so each second per loop counts!