0

I have a jupyter notebook, containing a pandas dataframe, with a column PAR (dtype = obj).

+------+------------------+
|      | PAR              |
+------+------------------+
| 0    | [[1.2.3, 2.3.4]] |
+------+------------------+
| 1    | [[3.2, 3.2]]     |
+------+------------------+

I do not understand how tyo 'clean' each [[list]] in each row, into something like [list].

I can print row contents:

print(df['PAR'][1])
print(', '.join(df['PAR'][1][0]))

This outputs:

[['3.2', '3.2']] 
3.2, 3.2

I can also 'strip' each cell into a string:

# df['PAR'] = df['PAR'].astype(str)
df['PAR'].replace(r'\[','', regex=True, inplace=True) 
df['PAR'].replace(r'\]','', regex=True, inplace=True) 
df['PAR'].replace(r'\'','', regex=True, inplace=True)

This gives a clean-ish string, although this is not the format that I need:

3.2, 3.2

But, what I'm looking for is a 1-level list in each row of my df, something like this:

+------+------------------+------------------+
|      | PAR              | PAR list         |
+------+------------------+------------------+
| 0    | [[1.2.3, 2.3.4]] | [1.2.3, 2.3.4]   |
+------+------------------+------------------+
| 1    | [[3.2, 3.2]]     | [3.2, 3.2]       |
+------+------------------+------------------+

(the spaces between comma and nth element are just for a better reading of the table above).

What would be a common approach to do this?

My next step is converting each new list into a list with only unique elements, following this thread: Get unique values from a list in python

mylist = ['nowplaying', 'PBS', 'PBS', 'nowplaying', 'job', 'debate', 'thenandnow']
myset = set(mylist)
mynewlist = list(myset)

So I'd appreciate some help to 'unlist' the lists in each row. A solution with a lambda-function (.map of .join?) would be easy for me to handle.

pljvp
  • 51
  • 6

2 Answers2

1

Input data:

>>> df
                PAR
0  [[1.2.3, 2.3.4]]
1      [[3.2, 3.2]]

Unlist* and remove duplicates in one step:

df["PAR"] = df["PAR"].str[0].apply(np.unique)

Output data:

>>> df
              PAR
0  [1.2.3, 2.3.4]
1           [3.2]

* Corrected with help from @SeanBean

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • It is the most informative & complete answer for me to learn. Thank you. And thank you @SeaBean & Corralien for the heads up! – pljvp May 14 '21 at 21:17
1

You can simply use .str[0] to access the first and only element of the outer list, effectively removing one level of list, as follows:

df['PAR list'] = df['PAR'].str[0]

Test data preparation:

data = {'PAR': [
[['1.2.3', '2.3.4']],
[['3.2', '3.2']]]
}
df = pd.DataFrame(data)

print(df)

                PAR
0  [[1.2.3, 2.3.4]]
1      [[3.2, 3.2]]

Run new code:

 df['PAR list'] = df['PAR'].str[0]

Result:

print(df)

                PAR        PAR list
0  [[1.2.3, 2.3.4]]  [1.2.3, 2.3.4]
1      [[3.2, 3.2]]      [3.2, 3.2]
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • I think `PAR` is not list but a string. – Corralien May 14 '21 at 20:29
  • @Corralien I just tested out, if `PAR` were a string, it won't give `3.2, 3.2` by the command `print(', '.join(df['PAR'][1][0]))` I think why OP can use `.replace()` was because he/she used `.astype(str)` before that. My first impression seeing the column is of `dtype = obj` was also whether it is of string type. But list items are also listed as `dtype=obj` in `df.info()` – SeaBean May 14 '21 at 20:40
  • You have probably right! I'll fix my answer. – Corralien May 14 '21 at 20:44
  • 1
    @Corralien Glad to have friendly discussion here. We are just answering for leisure :-) – SeaBean May 14 '21 at 20:47
  • 2
    Thanks again for your help. +1 – Corralien May 14 '21 at 21:12
  • @Corralien My pleasure! :-) – SeaBean May 14 '21 at 21:22