1

I have a python dataframe df of values from different systems:

 System Value1 Value2 Value3...
 S1     x      x      x...
 S2     x      x      x...
 S3     x      x      x...

And I want to know which Value1 entry occurs in all systems and write this into a list.

This is what I tired so far: First of all, I created a list of Value1 entries, which occur as often as the number of systems n (identvalue):

    identvalue = []
    from collections import defaultdict
    dic = defaultdict(int)
    Input = df['Value1']
    for i in Input:
        dic[i]+= 1
    n = len(systemno) # number of systems in list
    for element in Input:
        if element in dic.keys() and dic[element] == n:
            identvalue.append(element)
    identvalue=list(set(identvalue)) # remove multiple entries

Next, I have to remove those entries from the identvalue list which are occuring n times, but not once per system. So, I tried several things:

    idv = identvalue
    i=0
    while i < len(identvalue):
        tmp1= df.loc[df['Value1'] == identvalue[i]]
        no_ids = len(set(tmp1['System']))
        if no_ids != n:
            idv.remove(identvalue[i])
        i += 1

But here, I get an IndexError: list index out of range.

Then I tried:

    idv = identvalue
    for element in identvalue:
        tmp1= df.loc[df['Value1'] == element]
        no_ids = len(set(tmp1['System']))
        if no_ids != n:
            idv.remove(element)

But here, it does not run though the full identvalue list but finishes (without error message) after half of the list. Same happens when using enumarate function. What am I doing wrong? And I guess there's a much easier way to achieve my goal either way!?

Charnel
  • 4,222
  • 2
  • 16
  • 28

1 Answers1

0

The main problem with your code is that you remove elements of a list while iterating and even if you probably thought about this as you do idv = identvalue this code gives two names to the same object in memory. So doing idv.remove(identvalue[i]) is the same than identvalue.remove(identvalue[i]) hence the link above.

See how to actually copy a list if you want to try your code and avoid your current errors.

That said, your problem can be solved with groupby and nunique.

# data sample
df = pd.DataFrame({'System':['s1','s2','s3']*4, 
                   'Value1':[1,1,1,2,2,3,2,3,4,3,4,3]})

# get number of unique system in full dataframe
nb_unique = df['System'].nunique()
print(nb_unique)
# 3

# get for each value1 the number of unique system
s = df.groupby('Value1')['System'].nunique()
print(s) 
# Value1
# 1    3 # means Value1=1 has 3 unique System
# 2    2 # means Value1=2 has 2 unique System
# 3    3
# 4    2
# Name: System, dtype: int64

# keep only the Value1 that are the index of s
# that have the same number of unique systems as in the dataframe
res = s[s==nb_unique].index.tolist()
print(res)
# [1, 3]

Note that in the case of Value1=3, it happens twice for the System=s3, but at least once in each system, so I assume you would keep this value1 as weel.

Ben.T
  • 29,160
  • 6
  • 32
  • 54