1

I'm trying to group a DataFrame which consists of a DocID and a string using this SO as guide but instead of a dataframe with 1 row per DocID and all the string values separated by a space, I end up with a column containing the column values.

Can someone point out my error?

Sample Data

StringDF.head()

    DocID                                   LessStopWords
0   dd9ae7c8-7e98-4539-ab81-24c4780a6756    judgment of the court chamber 
1   dd9ae7c8-7e98-4539-ab81-24c4780a6756    the request proceedings
2   dd9ae7c8-7e98-4539-ab81-24c4780a6756    legal context law
3   dd9ae7c8-7e98-4539-ab81-24c4780a6756    article 1 directive
4   dd9ae7c8-7e98-4539-ab81-24c4780a6756    the status taken

My Code

DocsForTopicModel=StringDF.groupby(['DocID'],as_index=False).agg(lambda x : ' '.join(x))

My Output

     DocID                                  LessStopWords
 0  010b158d-8c0b-49ad-9340-774893e4f62f    DocID LessStopWords
 1  02874037-416d-4b91-8e2d-1a288b8c3a7b    DocID LessStopWords
 2  05b9ea7b-b5f0-4757-854c-b303a295f606    DocID LessStopWords
 3  06f87756-4dbe-4199-a8e2-b504451e823a    DocID LessStopWords
 4  070bd4d1-6830-447e-9042-12c6def18822    DocID LessStopWords

My Hoped For Output

     DocID                                      LessStopWords
     0  010b158d-8c0b-49ad-9340-774893e4f62f    judgment of the court chamber the request proceedings legal context law article 1 directive
     1  02874037-416d-4b91-8e2d-1a288b8c3a7b    ...
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
mobcdi
  • 1,532
  • 2
  • 28
  • 49
  • 2
    Your code seems to be working fine, so would df.groupby(['DocID'],as_index=False).LessStopWords.apply(' '.join) – Vaishali Oct 20 '18 at 17:07
  • I get a type error TypeError: sequence item 159: expected str instance, float found – mobcdi Oct 20 '18 at 17:21
  • @Vaishali Yep, but with a blankspace, Something like this: `df.groupby('DocID')['LessStopWords'].apply(' '.join).to_frame('LessStopWords').reset_index()` – Anton vBR Oct 20 '18 at 17:28
  • 1
    @AntonvBR, yes the code is with space, may be not very visible in the comment – Vaishali Oct 20 '18 at 17:29
  • 1
    @Vaishali You can enclose your code with **back-tick (` `)** – Anton vBR Oct 20 '18 at 17:29
  • @mobcdi, may be you have a NaN or some other float value at 159. You can try `df.groupby(['DocID'],as_index=False).LessStopWords.apply(lambda x: ' '.join(x.astype(str)))` – Vaishali Oct 20 '18 at 17:32

1 Answers1

2

You can also use .str.cat(sep=' ') (to do concatenation):

>>> df.groupby('DocID')['LessStopWords'].apply(lambda ser: ser.str.cat(sep=' '))
DocID
dd9ae7c8-7e98-4539-ab81-24c4780a6756    judgment of the court chamber the request proc...
Name: LessStopWords, dtype: object

More examples in Working with Text Data.


Larger example:

>>> import string
>>> import uuid
>>> 
>>> import numpy as np
>>> import pandas as pd
>>> 
>>> uids = np.random.choice([uuid.uuid4() for _ in range(3)], size=10)
>>> words = np.random.choice(list(string.ascii_letters), size=10)
>>> 
>>> df = pd.DataFrame({'DocID': uids, 'LessStopWords': words})
>>> df
                                  DocID LessStopWords
0  8ec3faf7-a771-4e50-87d7-127a69d4d738             p
1  0befc0aa-9311-4154-bced-00a280c99cdd             q
2  8ec3faf7-a771-4e50-87d7-127a69d4d738             t
3  de1021d3-ce47-4f56-8e4d-47d389473dd6             j
4  0befc0aa-9311-4154-bced-00a280c99cdd             L
5  8ec3faf7-a771-4e50-87d7-127a69d4d738             t
6  de1021d3-ce47-4f56-8e4d-47d389473dd6             g
7  0befc0aa-9311-4154-bced-00a280c99cdd             D
8  0befc0aa-9311-4154-bced-00a280c99cdd             d
9  8ec3faf7-a771-4e50-87d7-127a69d4d738             J
>>> df.groupby('DocID')['LessStopWords'].apply(lambda ser: ser.str.cat(sep=' '))
DocID
0befc0aa-9311-4154-bced-00a280c99cdd    q L D d
8ec3faf7-a771-4e50-87d7-127a69d4d738    p t t J
de1021d3-ce47-4f56-8e4d-47d389473dd6        j g
Name: LessStopWords, dtype: object
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235