0

I have a data frame that looks something like this:

Category Topic
Category1 Topic1
Category2 Topic2
Category1 Topic2
Category3 Topic3
Category2 Topic3
Category3 Topic3

And I want an output like this:

Category Topic Frequency
Category1 Topic1
Topic2
Topic3
Catgeory2 Topic1
Topic2
Topic3
Category3 Topic1
Topic2
Topic3

I am just starting out with python and I'd really appreciate it if someone could help me out with this.

  • you can check out `groupby` but I'm guessing you may actually be looking for a pivot https://stackoverflow.com/questions/47152691/how-can-i-pivot-a-dataframe – Chris Apr 06 '22 at 12:07
  • What should go in the `frequency` column? The relative frequency of `topic1`, `topic2` etc. within a `category`? (e.g. the sum for the first three rows of your output example would be 1? – Pierre D Apr 06 '22 at 12:07
  • Welcome to Stack Overflow. Take a look at the [guide](https://stackoverflow.com/help/how-to-ask) on how to ask a quality question. In particular, it's good to give a sense of what you already tried including things that you searched for on SO. – PeterK Apr 06 '22 at 13:05

1 Answers1

1

If the frequency is meant to capture the frequency of topic within each category, then, a basic approch involves:

df.groupby('Category')['Topic'].value_counts(normalize=True)

Which is a Series. For example, on your input data, we get:

Category   Topic 
Category1  Topic1    0.5
           Topic2    0.5
Category2  Topic2    0.5
           Topic3    0.5
Category3  Topic3    1.0
Name: Topic, dtype: float64

For an output organized as per your example, that appears to be a DataFrame with three columns:

out = (
    df
    .groupby('Category')['Topic']
    .value_counts(normalize=True)
    .to_frame('frequency')
    .reset_index()
)

Again, on your input sample:

>>> out
    Category   Topic  frequency
0  Category1  Topic1        0.5
1  Category1  Topic2        0.5
2  Category2  Topic2        0.5
3  Category2  Topic3        0.5
4  Category3  Topic3        1.0
Pierre D
  • 24,012
  • 7
  • 60
  • 96