0

I want to group users with their IPs (or something else). When IP(data)/user_id table is given such as in the below example, start with the user_id using the first data point (data = 1, users = (a,b,c)). Then gather other data values used by those users (users = (a,b,c), used_data = (2,4,5)). This continues until all users and data linked this way are discovered.

Example data (CSV, I substituted IP with random data to make it easier to read)

data,user_id
1,a
1,b
1,c
2,a
2,e
3,d
3,h
4,a
5,b
5,f
5,g
6,h
6,i

In short, i want to gather users who use the same data at least once.

Expected output in CSV,

group,data,user_id
1,[1,2,4,5],[a,b,c,e,f,g]
2,[3,6],[d,h,i]
pault
  • 41,343
  • 15
  • 107
  • 149
  • I don't understand how exactly you are getting the output can you please elaborate? – Sundeep Pidugu Aug 20 '19 at 07:00
  • I meant this 1. find users using specific ip 2. find ips used by the users in 1. 3. find users using the ips in 2. 4. repeatedly.. finally I want get groups of [(ips), (users)] – Youngseok Park Aug 21 '19 at 01:43

2 Answers2

0

I think it will be easier to do this out of pandas, with a little helper function. I've pasted your csv into a file to make it clear what the data variable contains. I've manually bypassed the header with indexing, it's up to you to automate this and remove the indices, as well as format the final result as a csv if needed. This solution is not explicitly recursive, but it's not (very) wasteful either.

with open('test.csv') as infile:
    data = [l.strip().split(',') for l in infile.readlines()]

def extract_group(data):
    # Start with one entry
    ip, user = data[0]
    ips = {ip}
    users = {user}
    ungrouped = data[1:]
    old_ungrouped_count = 0
    # Go through the remaining data until no related entries are found
    while len(ungrouped) != old_ungrouped_count:
        old_ungrouped_count = len(ungrouped)
        for ip, user in ungrouped[:]:
            if ip in ips or user in users:
                ips.add(ip)
                users.add(user)
                ungrouped.remove([ip, user])
    return (users, ips, ungrouped)

users, ips, ungrouped = extract_group(data[1:])
groups = [(sorted(ips), sorted(users))]
while ungrouped:
    users, ips, ungrouped = extract_group(ungrouped)
    groups.append((sorted(ips), sorted(users)))

for group in groups:
    print(group)

(['1', '2', '4', '5'], ['a', 'b', 'c', 'e', 'f', 'g'])
(['3', '6'], ['d', 'h', 'i'])
Ivan Popov
  • 431
  • 2
  • 6
0

You can use pandas.DataFrame.groupby method, you can find the documentation here. You can do it with SQL query if you want to, with using group_concat() or a substitute for it. For further reading look at this post.

Code

from io import StringIO
import pandas as pd

raw_data = """data,user_id
1,a
1,b
1,c
2,a
2,e
3,d
3,h
4,a
5,b
5,f
5,g
6,h
6,i"""


df = pd.read_csv(StringIO(raw_data), sep=',')
grouped_df = df.groupby(['data']).apply(lambda x: ','.join(x.user_id))

Output

data
1    a,b,c
2      a,e
3      d,h
4        a
5    b,f,g
6      h,i
dtype: object
null
  • 1,944
  • 1
  • 14
  • 24