Data Preprocessing in Python using Pandas

Question

I am trying to preprocess one of my columns in my Data frame. The issue is that I have [[ content1] , [content2], [content3]] in the relations column. I want to remove the Brackets

i have tried this following:

df['value'] = df['value'].str[0]

the output that i get is [content 1]

df
print df

id     value                 
1      [[str1],[str2],[str3]]        
2      [[str4],[str5]]       
3      [[str1]]        
4      [[str8]]       
5      [[str9]]      
6      [[str4]]

the expected output should be like

id     value                 
1      str1,str2,str3        
2      str4,str5       
3      str1        
4      str8       
5      str9      
6      str4

Is the column all strings? Maybe `df['value'] = df['value'].str.replace(r'\[|\]', '', regex=True)` — MDR, Aug 07 '21 at 19:11
It would be interesting to know the type of the values and the expected output — mozway, Aug 07 '21 at 19:21
@xsrg45 [don't use images for data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question). — MDR, Aug 07 '21 at 19:27
Try `df['relations'].apply(lambda x: ", ".join(i[0] for i in x))` — coffeinjunky, Aug 07 '21 at 19:47
@xsrg45, for future questions, please have a look at https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples to see how you can produce reprodible questions, including the structure of your data. — coffeinjunky, Aug 07 '21 at 19:53

mozway · Answer 1 · 2021-08-07T19:53:53.007

1

It looks like you have lists of lists. You can try to unnest and join:

df['value'] = df['value'].apply(lambda x: ','.join([e for l in x for e in l]))

Or:

from itertools import chain
df['value'] = df['value'].apply(lambda x: ','.join(chain.from_iterable(x)))

NB. If you get an error, please provide it and the type of the column (df.dtypes)

edited Aug 07 '21 at 19:53

answered Aug 07 '21 at 19:45

mozway

194,879
13
39
75

mozway, your first solution will yield like ` [,[,s,t,r,1,],,,[,s,t,r,2,],,,[,s,t,r,3,],]` – Karn Kumar Aug 07 '21 at 19:55
@KarnKunar My answer is on the basis that OP had lists of lists, not a string representation of lists (this the AttributeError) That said I could not test my code, I just wrote it done directly. Have you tested? – mozway Aug 07 '21 at 19:58
yes, based on the sample dataFrame provided by the OP, i've tested it and that's the output it generated. you could see that as i reproduced in my answer ie list of lists only – Karn Kumar Aug 07 '21 at 20:00
OK, I had not seen this edit. Well, done things are ambiguous then. Let's wait and see. If this is really a string representation your answer should work ;) – mozway Aug 07 '21 at 20:03
OK, I had not seen this edit. Well, done things are ambiguous then. Let's wait and see. If this is really a string representation your answer should work ;) – mozway Aug 07 '21 at 20:03

Karn Kumar · Answer 2 · 2021-08-07T20:14:31.573

0

As I could see, your data and sampling the same:

Sample Data:

df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':['[[str1],[str2],[str3]]', '[[str4],[str5]]', '[[str1]]',  '[[str8]]', '[[str9]]', '[[str4]]']})
print(df)
   id                   value
0   1  [[str1],[str2],[str3]]
1   2         [[str4],[str5]]
2   3                [[str1]]
3   4                [[str8]]
4   5                [[str9]]
5   6                [[str4]]

Result:

df['value'] = df['value'].str.replace('[', '').astype(str).str.replace(']', '')
print(df)
   id           value
0   1  str1,str2,str3
1   2       str4,str5
2   3            str1
3   4            str8
4   5            str9
5   6            str4

Note: as the error code says AttributeError: Can only use .str accessor with string values which means it's not treating it as str hence you may cast it to str by astype(str) and then do the replace operation.

edited Aug 07 '21 at 20:14

answered Aug 07 '21 at 19:36

Karn Kumar

8,518
3
27
53

now i get this error for the both suggestions AttributeError: Can only use .str accessor with string values! – xsrg45 Aug 07 '21 at 19:38
@xsrg45, please try , `df['value'].str.replace('[', '').astype(str).str.replace(']', '')` , i have edited my answer. – Karn Kumar Aug 07 '21 at 19:44
I bet that the brackets aren't part of the strings, but simply reflect that the cell contains a list of lists. – coffeinjunky Aug 07 '21 at 19:49
@coffeinjunky, what i could imagine from the error code is that its not treating it `str` while doing `replace` hence we can try casting it first to `str` like `astype(str)` and then do the rest of the operation and that should do the Job, and it works as per sample provided. – Karn Kumar Aug 07 '21 at 19:52

score 0 · Answer 3 · edited Aug 08 '21 at 09:24

You can use useful regex python package re. This is the solution.

import pandas as pd
import re

make the test data

    data = [
        [1, '[[str1],[str2],[str3]]'], 
        [2, '[[str4],[str5]]'], 
        [3, '[[str1]]'], 
        [4, '[[str8]]'], 
        [5, '[[str9]]'], 
        [6, '[[str4]]']
    ]

conver data to Dataframe

    df = pd.DataFrame(data, columns = ['id', 'value'])
    print(df)

enter image description here

remove '[', ']' from the 'value' column

    df['value']=df.apply(lambda x: re.sub("[\[\]]", "", x['value']),axis=1)
    print(df)

enter image description here

Data Preprocessing in Python using Pandas

3 Answers3

Sample Data:

Result: