-1

I have a file say : file1.txt, which has multiple rows and columns. I want to read that and store that as list of lists. Now I want to pair them using the logic, no 2 same rows can be in a pair. Now the 2nd lastcolumn represent the class. Below is my file:

27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92

Here all the 6 rows are class 1. I am using below logic to do this pairing part.

from operator import itemgetter

rule_file_name = 'file1.txt'
rule_fp = open(rule_file_name)

list1 = []
for line in rule_fp.readlines():
    list1.append(line.replace("\n","").split(","))

list1=sorted(list1,key=itemgetter(-1),reverse=True)

length = len(list1)
middle_index = length // 2
first_half = list1[:middle_index]
second_half = list1[middle_index:]
result=[]
result=list(zip(first_half,second_half))

for a,b in result:
    if a==b:
        result.remove((a, b))

print(result)
print("-------------------")

It is working absolutely fine when I have one class only. But if my file has multiple classes then I want the pairing to be done with is the same class only. For an example if my file looks like below: say file2

27,28,29,30,1,0.67
31,32,33,34,1,0.84
35,36,37,38,1,0.45
39,40,41,42,1,0.82
43,44,45,46,1,0.92
43,44,45,46,1,0.92
51,52,53,54,2,0.28
55,56,57,58,2,0.77
59,60,61,62,2,0.39
63,64,65,66,2,0.41
75,76,77,78,3,0.51
90,91,92,93,3,0.97

Then I want to make 3 pairs from class 1, 2 from class 2 and 1 from class 3.Then I am using this logic to make the dictionary where the keys will be the classes.

d = {}
sorted_grouped = []
for row in list1:
    # Add name to dict if not exists
    if row[-2] not in d:
        d[row[-2]] = []
    # Add all non-Name attributes as a new list
    d[row[-2]].append(row)
#print(d.items())

for k,v in d.items():
    sorted_grouped.append(v)
#print(sorted_grouped)

gp_vals = {}
for i in sorted_grouped:
    gp_vals[i[0][-2]] = i
print(gp_vals)

Now how can I do it, please help !

My desired output for file2 is:

[([43,44,45,46,1,0.92], [39,40,41,42,1,0.82]), ([43,44,45,46,1,0.92], [27,28,29,30,1,0.67]), ([31,32,33,34,1,0.84], [35,36,37,38,1,0.45])] [([55,56,57,58,2,0.77], [59,60,61,62,2,0.39]), ([63,64,65,66,2,0.41], [51,52,53,54,2,0.28])] [([90,91,92,93,3,0.97], [75,76,77,78,3,0.51])]

Edit1:

  1. All the files will have even number of rows, where every class will have even number of rows as well.

  2. For a particular class(say class 2), if there are n rows then there can be maximum n/2 identical rows for that class in the dataset.

  3. My primary intention was to get random pairing but making sure no self pairing is allowed. For that I thought of taking the row with the highest fitness value(The last column) inside any class and take any other row from that class randomly and make a pair just by making sure both the rows are not exactly the same. And this same thing is repeated for every class separately.

Dev
  • 576
  • 3
  • 14
  • 3
    Your first snippet which "is working absolutely fine" has a couple of potential problems. The `zip()` will leave out the last row of the file if it contains an odd number of them. You're modifying the list you're iterating over when trying to remove pairs, which won't work right if it ever removes one of them (it doesn't with the sample data in your question). I think you should fix these things before trying to make the code handle multiple classes. – martineau Apr 08 '22 at 17:35
  • About the first case, actually my file will always have even number of rows. And about the second problem can you please explain a bit more please! – Dev Apr 08 '22 at 17:44
  • 1
    You should have mentioned that there would always be an even number of rows. For second problem see [How to modify list entries during for loop?](https://stackoverflow.com/questions/4081217/how-to-modify-list-entries-during-for-loop) – martineau Apr 08 '22 at 17:51
  • Yes, I should have mentioned that. Sorry for that. I have read the answer and understood that. But can you please guide me how to do the multi class thing! Because then I will be able to check with my larger datasets and will come to know if things will work fine or not and accordingly I'll try to fix that. – Dev Apr 08 '22 at 17:56
  • 1
    @Dev are you expecting a *random* pairing within each class, or would the first two in class 1, second two in class 1, etc. pairings be acceptable? – Lucas Roberts Apr 08 '22 at 18:03
  • @LucasRoberts Actually my intention was to get random pairing but making sure no self pairing is allowed. For that I thought of taking the row with the highest fitness value(The last column) inside any class and take any other row from that class randomly and make a pair just by making sure both the rows are not exactly the same. And this same thing is repeated for every class separately. – Dev Apr 08 '22 at 18:08
  • @Dev in file2.txt is the row `43,44,45,46,1,0.92` that gets duplicated valid? e.g. could that row be paired with the duplicate? e.g. in the edit `make a pair just by making sure both the rows are not exactly the same` but two rows are already exactly the same... – Lucas Roberts Apr 08 '22 at 18:47
  • @LucasRoberts As there are 6 rows for class 1, so every row has 5 option to chose apart from `43,44,45,46,1,0.92` as it can't pair with itself, so it has to pick from the remaining 4 rows from the same class! – Dev Apr 08 '22 at 18:50
  • ([43,44,45,46,1,0.92],[43,44,45,46,1,0.92]) This can't be one pair. – Dev Apr 08 '22 at 18:51
  • The file can contain duplicated values but those duplicate rows will never make pair with themselves. – Dev Apr 08 '22 at 18:53
  • also the first entry in the second tuple of the expected output `[31, 32, 33, 34, 1, 0.92]` is not in the example file... and the duplicate `[43,44,45,46,1,0.92]` only appears once – Lucas Roberts Apr 08 '22 at 18:56
  • Sorry I messed that up a little. I am correcting that. – Dev Apr 08 '22 at 18:58
  • @LucasRoberts I am extremely sorry, Now I have edited my desired output part. – Dev Apr 08 '22 at 19:04
  • @Dev are the duplicates always from the highest 'fitness' value? – Lucas Roberts Apr 08 '22 at 19:26
  • @LucasRoberts yes. – Dev Apr 08 '22 at 19:27
  • Hey! @LucasRoberts now I am facing a situation where the duplicates are not always from the highest fitness. Example `[1,*,*,3,2,0.95] [*,*,3,2,2,0.66] [*,*,3,4,2,0.67] [3,*,*,*,2,0.33] [3,*,*,*,2,0.33] [3,*,*,*,2,0.33]` if these are my 6 rows , then the first 2 selected pairs are `(['1', '*', '*', '3', '2', '0.95'],['*', '*', '3', '2', '2', '0.66']) (['*', '*', '3', '4', '2', '0.67'],['3', '*', '*', '*', '2', '0.33'])` now the remaining are : `[3,*,*,*,2,0.33] [3,*,*,*,2,0.33]` which can't make a pair. The program is stopping there, I just simply want those two, not to get selected. – Dev Apr 09 '22 at 16:35
  • In the above question, while editing(Edit:1) I actually mentioned this in point 2. – Dev Apr 09 '22 at 16:44

1 Answers1

1

First read in the data from the file, I'd use assert here to communicate your assumptions to people who read the code (including future you) and to confirm the assumption actually holds for the file. If not it will raise an AssertionError.

rule_file_name = 'file2.txt'
list1 = []
with open(rule_file_name) as rule_fp:
    for line in rule_fp.readlines():
        list1.append(line.replace("\n","").split(","))

assert len(list1) & 1 == 0 # confirm length is even

Then use a defaultdict to store the lists for each class.

from collections import defaultdict

classes = defaultdict(list)
for _list in list1:
    classes[_list[4]].append(_list)

Then use sample to draw pairs and confirm they aren't the same. Here I'm including a seed to make the results reproducible but you can take that out for randomness.

from random import sample, seed

seed(1) # remove this line when you want actual randomness
for key, _list in classes.items():
    assert len(_list) & 1 == 0 # each also be even else an error in data
    _list.sort(key=lambda x: x[5])
    pairs = []
    while _list:
        first = _list[-1]
        candidate = sample(_list, 1)[0]
        if first != candidate:
            print(f'first {first}, candidate{candidate}')
            pairs.append((first, candidate))
            _list.remove(first)
            _list.remove(candidate)
    classes[key] = pairs

Note that an implicit assumption in the way to do the sampling (stated in edit) is that the duplicates arise from the highest fitness values. If this is not true this could go into an infinite loop.

If you want to print them then iterate over the dictionary again:

for key, pairs in classes.items():
    print(key, pairs)

which for me gives:

1 [(['43', '44', '45', '46', '1', '0.92'], ['27', '28', '29', '30', '1', '0.67']), (['43', '44', '45', '46', '1', '0.92'], ['31', '32', '33', '34', '1', '0.84']), (['39', '40', '41', '42', '1', '0.82'], ['35', '36', '37', '38', '1', '0.45'])]
2 [(['55', '56', '57', '58', '2', '0.77'], ['51', '52', '53', '54', '2', '0.28']), (['63', '64', '65', '66', '2', '0.41'], ['59', '60', '61', '62', '2', '0.39'])]
3 [(['90', '91', '92', '93', '3', '0.97'], ['75', '76', '77', '78', '3', '0.51'])]

Using these values for file2.text-the first numbers are row numbers and not part of the actual file.

 1 27,28,29,30,1,0.67
 2 31,32,33,34,1,0.84
 3 35,36,37,38,1,0.45
 4 39,40,41,42,1,0.82
 5 43,44,45,46,1,0.92
 6 43,44,45,46,1,0.92
 7 51,52,53,54,2,0.28
 8 55,56,57,58,2,0.77
 9 59,60,61,62,2,0.39
10 63,64,65,66,2,0.41
11 75,76,77,78,3,0.51
12 90,91,92,93,3,0.97
Lucas Roberts
  • 1,252
  • 14
  • 17
  • `NameError: name 'defaultdict' is not defined` – Dev Apr 08 '22 at 19:36
  • you need to import it first, I added a line for that – Lucas Roberts Apr 08 '22 at 19:38
  • @martineau I'm not sure I follow, I'm not taking modulus anywhere, is it possible you've misread my answer? Plus `assert len(list1) & 2 == 0` will not confirm the list's length is even, only that the 2 bit is off. Consider `4` which is `bin(4)` or `'0b100'` in python, here the two bit is not set, yet 4 ***is*** divisible by 2. – Lucas Roberts Apr 08 '22 at 20:35
  • Hey! @LucasRoberts now I am facing a situation where the duplicates are not always from the highest fitness. Example `[1,*,*,3,2,0.95] [*,*,3,2,2,0.66] [*,*,3,4,2,0.67] [3,*,*,*,2,0.33] [3,*,*,*,2,0.33] [3,*,*,*,2,0.33]` if these are my 6 rows , then the first 2 selected pairs are `(['1', '*', '*', '3', '2', '0.95'],['*', '*', '3', '2', '2', '0.66']) (['*', '*', '3', '4', '2', '0.67'],['3', '*', '*', '*', '2', '0.33'])` now the remaining are : `[3,*,*,*,2,0.33] [3,*,*,*,2,0.33]` which can't make a pair. The program is stopping there, I just simply want those two, not to get selected. – Dev Apr 09 '22 at 16:48
  • My observation is only the last remaining pair will fall in this trap. For an example if there are 14 rows then the 6 pairs will be formed correctly, only the last remaining 2 will fall in this trap. I was trying with `while(len(_list)>2):` but in this case the last 2 will be left out always irrespective of they are same or different. – Dev Apr 09 '22 at 17:01
  • https://stackoverflow.com/questions/71810312/reading-a-file-in-list-of-list-form-and-pairing-the-rows-of-that-in-python this is the detailed one. Please tell me if I can use a timer? – Dev Apr 09 '22 at 18:33