1

I have an excel file that I'm taking data from. The data is basically a table of a day, and people playing together

day pl1 pl2
Mon 1000 1011
Tue 1100 0101
Tue 1000 0121
Wed 0101 1101
Mon 0210 1212
Wed 1000 0101

I want to check that no id is playing more than once(regardless if it is in column of pl1 or pl2) in one day. For example , in Wed, we have "0101" two times, one as pl1, and one in pl1, and I want to catch this.

And I'm looking of which would be the fastest and more pythonic way.

  1. I have thought of checking all elements of pl1 list, and pl2 and if I find same value anywhere, check if the value on the column of the day, is same, or not. However, not only I think this would be extremely slow to process, I also think is more complicate to check

  2. The other thought is to move them in lists of lists [[Mon,1000,1012],[Tue,1110,0101]...] and group them by day and then, check also the rest of elements? Still seems that too much time.

  3. Should I create a list of touples instead ? (I already use tupples and check that there is no identical pairs (same id and in same position pl1 an pl2)

Is there any fastest and more compact way ?

Thanks

ToErotimatiko
  • 179
  • 2
  • 3
  • 14

1 Answers1

1

I like this question, it's a nice little brain teaser!

There are probably module tools out there in pandas that can do something sql-like with a self join or the like, but I wanted to see if it could easily be done with just core Python.

The first way that occurred to me was to create a list of tuples, where each tuple is a pair of values, the first being the day and the 2nd being one of the players. Each line of values would then generate 2 tuples, one for player 1 and one for player 2. I.e., the first line of data:

Mon,1000,1011

would be put into 2 tuples:

('Mon',1000),('Mon',1011)

Add all of those tuples to a list. Then search the list for duplicates, which is a common enough task. Assuming the player data was in a csv file called "players.csv" like this:

day,pl1,pl2
Mon,1000,1011
Tue,1100,0101
Tue,1000,0121
Wed,0101,1101
Mon,0210,1212
Wed,1000,0101

the code below should return the information you are looking for:

# import data
playerFile = open('players.csv','r')
playerData = playerFile.readlines()[1:]
playerFile.close()

# create list of tuples, 1 tuple for each player in line of data
plays = []
for line in playerData:
    day,p1,p2 = line[:-1].split(',')
    plays.extend([(day,p1),(day,p2)])

# now check the list for duplicate tuples
# since each tuple is a day that a player played, 
# if a player played twice on 1 day, it'll be a duplicate
seen = set()
dupes = set()

for play in plays:
    if play in seen:
        dupes.add(play)
    else:
        seen.add(play)
        
# print duplicates
for dupe in dupes:
    print(dupe)

Output:

('Wed', '0101')

(Note that I had to use tuples instead of lists for each "play", since I wanted to only identify each duplicate once, i.e. if player 0101 played 3 times on one day, I only wanted one item. A set does that nicely, but you can't add lists to sets since they are mutable)

RNGHatesMe
  • 134
  • 7
  • Wow, nice idea ! For my beginner, self-taught mind, this is very understandable concept ! :) It makes sense ! And I should have thought of this, because I already use tupples so I can check that no pairs plays together more than once, in the same position ! Thank you !!! – ToErotimatiko Dec 18 '21 at 13:05
  • I like questions like these because they are very open-ended, and there are many perfectly fine ways to approach them. I challenged my son to look at it (without seeing my approach) to see what he would do, and he suggested creating a list of players for each day, then checking each day for duplicates. However, he agreed that this approach was better as there was only a single list to check for duplicates. I'm originally self-taught myself ;-). In some ways I think it's better as it leaves us freer to original approaches! – RNGHatesMe Dec 18 '21 at 13:46
  • Also, this question is very "database"y, and would lend itself well to database type solutions. If you are going to be dealing with this type of data a lot, you should look into getting familiar with SQL. – RNGHatesMe Dec 18 '21 at 13:51