2

I've got two lists of tuples of the form:

playerinfo = [(ansonca01,4,1871,1,RC1),(forceda01,44,1871,1,WS3),(mathebo01,68,1871,1,FW1)]

idmatch = [(ansonca01,Anson,Cap,05/06/1871),(aaroh101,Aaron,Hank,04/13/1954),(aarot101,Aaron,Tommie,04/10/1962)]

What I would like to know, is how could I iterate through both lists, and if the first element in a tuple from "playerinfo" matches the first element in a tuple from "idmatch", merge the matching tuples together to yield a new list of tuples? In the form:

merged_data = [(ansonca01,4,1871,1,RC1, Anson,Cap,05/06/1871),(...),(...), etc.] 

The new list of tuples would have the ID number matched to the first and last names of the correct player.

Background info: I'm trying to merge two CSV documents of baseball statistics, but the one with all of the relevant stats doesn't contain player names, only a reference number e.g. 'ansoc101', while the second document contains the reference number in one column and the first and last names of the corresponding player in the other.

The size of the CSV is too large to do this manually (about 20,000 players), so I'm trying to automate the process.

martineau
  • 119,623
  • 25
  • 170
  • 301
re_ed1138
  • 173
  • 3
  • 7
  • Tuples are immutable; you can't change which variables they contain after construction. I'd just use list of lists, or even better, list of objects, or a sinlge object, with key being the id number. – taesu Mar 02 '15 at 16:07

3 Answers3

6

Use a list comprehension to iterate over your lists:

[x + y[1:] for x in list1 for y in list2 if x[0] == y[0]]

I tried this on the lists:

list1 = [("this", 1, 2, 3), ("that", 1, 2, 3), ("other", 1, 2, 3)]
list2 = [("this", 5, 6, 7), ("that", 10, 11, 12), ("notother", 1, 2, 3)]

and got:

[('this', 1, 2, 3, 5, 6, 7), ('that', 1, 2, 3, 10, 11, 12)]

Is that what you wanted?

Sam
  • 8,330
  • 2
  • 26
  • 51
  • 2
    Actually this solution, while succinct, is quite inefficient. I forgot that you said you were working with around 20,000 items. This comprehension would do 20,000 x 20,000 comparisons, i.e. too many. The other solutions using dictionaries are *much* better for large datasets. – Sam Mar 02 '15 at 17:13
4

You could first create a dictionary to enable fast ID number look-ups, and then merge the data from the two lists together very efficiently with a list comprehension:

import operator

playerinfo = [('ansonca01', 4, 1871, 1, 'RC1'),
              ('forceda01', 44, 1871, 1, 'WS3'),
              ('mathebo01', 68, 1871, 1, 'FW1')]

idmatch = [('ansonca01', 'Anson', 'Cap', '05/06/1871'),
           ('aaroh101', 'Aaron', 'Hank', '04/13/1954'),
           ('aarot101', 'Aaron', 'Tommie', '04/10/1962')]

id = operator.itemgetter(0)  # To get id field.

idinfo = {id(rec): rec[1:] for rec in idmatch}  # Dict for fast look-ups.

merged = [info + idinfo[id(info)] for info in playerinfo if id(info) in idinfo]

print(merged) # -> [('ansonca01', 4, 1871, 1, 'RC1', 'Anson', 'Cap', '05/06/1871')]
martineau
  • 119,623
  • 25
  • 170
  • 301
0

Dictionary

  1. Iterate on playerinfo list and create dictionary where key is first item from the tuple and value is list of all items.
  2. Print result of first step.
  3. Again iterate on idmatch list and check first item of tuple in the result dictionary or not. If It is present then extend value of key with new values by list extend method.
  4. Print result of second step.
  5. Create output format from the generated dictionary.

Demo:

import pprint

playerinfo = [("ansonca01",4,1871,1,"RC1"),\
              ("forceda01",44,1871,1,"WS3"),\
              ("mathebo01",68,1871,1,"FW1")]

idmatch = [("ansonca01","Anson","Cap","05/06/1871"),\
           ("aaroh101","Aaron","Hank","04/13/1954"),\
           ("aarot101","Aaron","Tommie","04/10/1962")]

result = {}
for i in playerinfo:
    result[i[0]] =  list(i[:])

print "Debug Rsult1:"
pprint.pprint(result)

for i in idmatch:
    if i[0] in result:
        result[i[0]].extend(list(i[1:])) 

print "\nDebug Rsult2:"
pprint.pprint(result)

final_rs = []
for i,j in result.items():
    final_rs.append(tuple(j))

print "\nFinal result:"

pprint.pprint(final_rs)

Output:

infogrid@infogrid-vivek:~/workspace/vtestproject$ python task4.py 
Debug Rsult1:
{'ansonca01': ['ansonca01', 4, 1871, 1, 'RC1'],
 'forceda01': ['forceda01', 44, 1871, 1, 'WS3'],
 'mathebo01': ['mathebo01', 68, 1871, 1, 'FW1']}

Debug Rsult2:
{'ansonca01': ['ansonca01', 4, 1871, 1, 'RC1', 'Anson', 'Cap', '05/06/1871'],
 'forceda01': ['forceda01', 44, 1871, 1, 'WS3'],
 'mathebo01': ['mathebo01', 68, 1871, 1, 'FW1']}

Final result:
[('ansonca01', 4, 1871, 1, 'RC1', 'Anson', 'Cap', '05/06/1871'),
 ('forceda01', 44, 1871, 1, 'WS3'),
 ('mathebo01', 68, 1871, 1, 'FW1')]
infogrid@infogrid-vivek:~/workspace/vtestproject$ 
Vivek Sable
  • 9,938
  • 3
  • 40
  • 56