-2

I have a DataFrame with 3 columns (A,B,C) and a large number of rows. There are different types of elements in each of these columns: A1,A2... B1,B2... and C1, C2... respectively.

I want to find the number of times a particular combination (say (A1,B2,C2)) occurs in a row. Then I want to generate a (summary) list of all combinations whose frequency-counts exceed a fixed threshold. Example: combinations number of counts (A1,B2,C2) 5 (A2,B2,C2) 7 .... if the fixed value is 6.

I am new to pandas and numpy. Can this be done efficiently using pandas, and if so how?

mihir shanvir
  • 193
  • 1
  • 1
  • 13
  • 1
    You need to add example code and data. Also, just call your columns A,B,C and their values A1,A2,..., B1,B2,... ,C1,C2,... for clarity. – smci Jun 06 '17 at 08:39
  • 1
    Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Jun 06 '17 at 08:39
  • Yes it can be done in pandas. When you say *"I want to find the no. of times a particular combination (a_i,b_j,c_k) occurs in a row."*, do you want your output to be a table, or what? Show us example output. Then, write a better title, because *"iterating over a dataframe using pandas"* tells us nothing. – smci Jun 06 '17 at 08:40
  • @smci I want to get a list of all those combinations whose counts are greater than a fixed value. for ex: combinations No. of counts (A1,B2,C2) 5 (A2,B2,C2) 7 .... if the fixed value is 6. then i want a list of combinations for which count is greater than 6. – mihir shanvir Jun 06 '17 at 10:34
  • @MihirShanvir - Is order important in output list? – jezrael Jun 06 '17 at 10:36
  • @jezrael no it's not important. – mihir shanvir Jun 06 '17 at 10:44
  • @MihirShanvir - check my answer, if some problem let me know. – jezrael Jun 06 '17 at 10:47
  • @MihirShanvir: then please edit those new requirements into your question (click the 'Edit' button above). The original statement was very unclear ("iterate over dataframe"), that's why it got downvoted and closed. – smci Jun 07 '17 at 17:38
  • @MihirShanvir: useful terms to use here: *"frequency-count"* and *"generate table of frequency-counts for variable combinations"*, then *"apply a threshold min-count to that table"*. – smci Jun 07 '17 at 17:42
  • @MihirShanvir so in your example where the frequency-counts are (A1,B2,C2) 5 (A2,B2,C2) 7 and your threshold is 6, should the list exclude (A1,B2,C2) or not? – smci Jun 07 '17 at 17:44

2 Answers2

1
df = pd.DataFrame({'A':['A1','A1','A2','A3'],
                   'B':[4,4,6,4],
                   'C':[7,7,9,7]})

print (df)
    A   B  C
0   4   7  C1
1   4   7  C1
2   6   9  C2
3   4   7  C3

For count of all combination use groupby + size:

s = df.groupby(["A", "B","C"]).size()
print (s)
A   B   C
A1  4   7     2
A2  6   9     1
A3  4   7     1
dtype: int64

For filtered list by value add boolean indexing:

L = s.index[s > 1].tolist()
print (L)
[('A1', 4, 7)]
smci
  • 32,567
  • 20
  • 113
  • 146
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Using only pandas, one way could be to use DataFrame.groupby():

counts = dict()

for group in df.groupby(['It', 'Cc', 'Ct']):
    print(group)
    counts[group[0]] = len(group[1])

A faster way is probably to transform your dataframe into a list and use Counter from collections:

from collections import Counter

listed_df = [tuple(line) for line in list(df.values)]
counts = Counter(listed_df)
Pietro Tortella
  • 1,084
  • 1
  • 6
  • 13