0

my question related to this link here

Excellent explanation in the above link. But in my situation is little bit different.

user     meetings
178787    287750
178787    151515
178787    158478
576585    896352
576585    985639
576585    456988

expected result is

user       meetings
178787   "[287750,151515,158478]"
576585   "[896352,985639,456988]"

How can i make this done using python with above code. Thanks in advance.

Arun Oid
  • 11
  • 6

4 Answers4

1

You could read in the file, line by line, split the lines and add the meeting to a dictionary where the key is the user. This can be done very neatly using the method seen here.

We can then write this dictionary back to the same file using tabs to make everything line up.

So, assuming your file is called f.csv, the code would look something like:

d = {}
for l in open('f.csv').read().split('\n')[1:-1]:
    u, m = l.split()
    d.setdefault(u, []).append(m)

with open('f.csv', 'w') as f:
    f.write('user\tmeetings\n')
    for u, m in d.items():
        f.write(u + '\t' + str(m) + '\n')

Which produces the desired output of:

user    meetings
178787  ['287750', '151515', '158478']
576585  ['896352', '985639', '456988']
Joe Iddon
  • 20,101
  • 7
  • 33
  • 54
0
from collections import defaultdict
import csv

inpath = ''  # Path to input CSV file
outpath = ''  # Path to output CSV file

output = defaultdict(list)  # Dictionary like {user_id: [meetings]}

for row in csv.DictReader(open(inpath)):
    output[row['user']].append(row['meetings'])

with open(outpath, 'w') as f:
    for user, meetings in output.items():
        row = user + ',' + str(meetings) + '\n'
        f.write(row)
0

Since user is going to be the key, let's stuff a dictionary. Note: this will ultimately load the entire file into memory once, but it does not require the file to be sorted by user first. Also note the output is not sorted either (because dict.items() does not retrieve dictionary items in any deterministic order).

output = {}
with f as open('input.csv'):
    for line in f:
        user, meeting = line.strip('\r\n').split()
        # we strip newlines before splitting on whitespace

        if user not in output and user != 'user': 
            # the user was not found in the dict (and we want to skip the header)
            output[user] = [meeting] # add the user, with the first meeting
        else: # user already exists in dict
            output[user].append(meeting) # add meeting to user entry

# print output header
print("user meetings") # I used a single space, feel free to use '\t' etc.
# lets retrieve all meetings per user
for user, meetings in output.items() # in python2, use .iteritems() instead
    meetings = ','.join(_ for _ in meetings) # format ["1","2","3"] to "1,2,3"
    print('{} "[{}]"'.format(user, meetings))

Fancier: sort output. I do this by sorting the keys first. Note that this will use even more memory since I am creating a list of the keys too.

# same as before
output = {}
with f as open('input.csv'):
for line in f:
    user, meeting = line.strip('\r\n').split()
    # we strip newlines before splitting on whitespace

    if user not in output and user != 'user': 
        # the user was not found in the dict (and we want to skip the header)
        output[user] = [meeting] # add the user, with the first meeting
    else: # user already exists in dict
        output[user].append(meeting) # add meeting to user entry

# print output header
print("user meetings") # I used a single space, feel free to use '\t' etc.

# sort my dict keys before printing them:
for user in sorted(output.keys()):
    meetings = ','.join(_ for _ in output[user])
    print('{} "[{}]"'.format(user, meetings))
cowbert
  • 3,212
  • 2
  • 25
  • 34
0

Pandas groupby provides a nice solution:

import pandas as pd

df = pd.read_csv('myfile.csv', columns=['user', 'meetings'])
df_grouped = df.groupby('user')['meetings'].apply(list).astype(str).reset_index()
jpp
  • 159,742
  • 34
  • 281
  • 339