0

I have a df with 2 columns. I want to group the URL with the help of IP column and get a list of list containing the grouped items.

d3 = df2.groupby('IP')['URL'].apply(lambda x:','.join(x.dropna().unique())).reset_index()

dataset =[]
for index, rows in d3.iterrows():
     my_list =[rows.URL]
     dataset.append(my_list)

This gives me a list of list which looks like

[['item1, item2, item3'],['item4'],['item5,item6']] and so on.

I would like to have it like this:

[['item1','item2','item3'],['item4'],['item5','item6']]

How can I achieve this??

cat1234
  • 49
  • 3
  • All you would need to do is use `split(',')` on rows.URL. `my_list =[rows.URL.split(',')]` – sxddhxrthx Aug 23 '21 at 16:54
  • Hi thanks for replying! `d3.URL.str.split(',')` just returns the same row but within [] . Any idea why items are not within single quotes? – cat1234 Aug 23 '21 at 17:14

3 Answers3

0

You can use .split(), like this:

d3 = df2.groupby('IP')['URL'].apply(lambda x:','.join(x.dropna().unique())).reset_index()

dataset =[]
for index, rows in d3.iterrows():
     my_list =[rows.URL]
     grouped_items = [i.strip() for i in my_list[0].split(',')]
     dataset.append(grouped_items)
Suneesh Jacob
  • 806
  • 1
  • 7
  • 15
0

Since your sub-lists only contain 1 element, you can change my_list = [rows.URL] to

my_list = rows.URL.split(',')

If they have space or other unwanted characters, try using Regular Expression:

my_list = re.split('\W+', rows.URL)

Both ways return a list, you can later extend your dataset with it (not append it)

dataset.extend(my_list)
AcaNg
  • 704
  • 1
  • 9
  • 26
0

To get the multiple items with individual strings, you should amend your first line of code to aggregate the multiple items into a list of strings instead of joining the items into a single string.

You got multiple items concatenated into a single string e.g. 'item1, item2, item3' (with only one and only one pair of single quotes at both ends but not around each item) instead of distinct strings e.g. 'item1','item2','item3' because you joined the individual strings into one string for each group of IP by using ','.join(....) within the .apply() function.

Amend your first line of code as follows:

d3 = df2.groupby('IP')['URL'].apply(lambda x: x.dropna().unique().tolist()).reset_index()

You can also simplify your codes of extracting the list of string with looping by replacing the loop with as simple as one line, as follows:

dataset = d3['URL'].tolist()

Demo

Input data:

import numpy as np

data = {'IP': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3],
 'URL': ['item1', 'item2', 'item3', 'item3', 'item4', np.nan, 'item4', 'item5', 'item5', np.nan, 'item6']}
df = pd.DataFrame(data)

print(df)

    IP    URL
0    1  item1
1    1  item2
2    1  item3
3    1  item3
4    2  item4
5    2    NaN
6    2  item4
7    3  item5
8    3  item5
9    3    NaN
10   3  item6

Output

print(dataset)

[['item1', 'item2', 'item3'], ['item4'], ['item5', 'item6']]
SeaBean
  • 22,547
  • 3
  • 13
  • 25