9

Here's some data from another question:

                          positive                 negative          neutral
1   [marvel, moral, bold, destiny]                       []   [view, should]
2                      [beautiful]      [complicated, need]               []
3                      [celebrate]   [crippling, addiction]            [big]

What I would do first is to add quotes across all words, and then:

import ast

df = pd.read_clipboard(sep='\s{2,}')
df = df.applymap(ast.literal_eval)

Is there a smarter way to do this?

Alexander
  • 105,104
  • 32
  • 201
  • 196
cs95
  • 379,657
  • 97
  • 704
  • 746

5 Answers5

12

Lists of strings

For basic structures you can use yaml without having to add quotes:

import yaml
df = pd.read_clipboard(sep='\s{2,}').applymap(yaml.load)

type(df.iloc[0, 0])
Out: list

Lists of numeric data

Under certain conditions, you can read your lists as strings and the convert them using literal_eval (or pd.eval, if they are simple lists).

For example,

           A   B
0  [1, 2, 3]  11
1  [4, 5, 6]  12

First, ensure there are at least two spaces between the columns, then copy your data and run the following:

import ast 

df = pd.read_clipboard(sep=r'\s{2,}', engine='python')
df['A'] = df['A'].map(ast.literal_eval)    
df
    
           A   B
0  [1, 2, 3]  11
1  [4, 5, 6]  12

df.dtypes

A    object
B     int64
dtype: object

Notes

  • for multiple columns, use applymap in the conversion step:

    df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(ast.literal_eval)
    
  • if your columns can contain NaNs, define a function that can handle them appropriately:

    parser = lambda x: x if pd.isna(x) else ast.literal_eval(x)
    df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(parser)
    
  • if your columns contain lists of strings, you will need something like yaml.load (requires installation) to parse them instead if you don't want to manually add quotes to the data. See above.

cs95
  • 379,657
  • 97
  • 704
  • 746
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • 2
    I think the import is a small price to pay for having a solution this bullet-proof. As long as it's builtin, it's fine. All the other answers were wonderful, but I accept this one in particular because of its simplicity. – cs95 Aug 22 '17 at 22:03
  • Correction to my previous comment: yaml is not built-in, the library needs to be pip installed. Apologies for the misinfo! – cs95 Jan 18 '21 at 09:50
8

I did it this way:

df = pd.read_clipboard(sep='\s{2,}', engine='python')
df = df.apply(lambda x: x.str.replace(r'[\[\]]*', '').str.split(',\s*', expand=False))

PS i'm sure - there must be a better way to do that...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 4
    Better than this way: `df = pd.DataFrame({'positive': [['marvel', 'moral', 'bold', 'destiny'], ['beutiful'], ['celebrate']], 'negative': [[], ['complicated', 'need'], ['crippling', 'addiction']], 'neutral': [['view', 'should'], [], ['big']]})`... – Alexander Aug 22 '17 at 21:44
  • 3
    For small snippets, `df.applymap(lambda x: x[1:-1].split(', '))` might also work (but does not handle spaces as robustly as MaxU's answer.) – unutbu Aug 22 '17 at 21:47
  • @unutbu, it looks good - would you turn your comment into an answer? – MaxU - stand with Ukraine Aug 22 '17 at 21:50
6

Another alternative is

In [43]:  df.applymap(lambda x: x[1:-1].split(', '))
Out[43]: 
                         positive                negative         neutral
1  [marvel, moral, bold, destiny]                      []  [view, should]
2                     [beautiful]     [complicated, need]              []
3                     [celebrate]  [crippling, addiction]           [big]

Note that this assumes the first and last character in each cell is [ and ]. It also assumes there is exactly one space after the commas.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
5

Another version:

df.applymap(lambda x:
            ast.literal_eval("[" + re.sub(r"[[\]]", "'", 
                                          re.sub("[,\s]+", "','", x)) + "]"))
DYZ
  • 55,249
  • 10
  • 64
  • 93
5

Per help from @MaxU

df = pd.read_clipboard(sep='\s{2,}', engine='python')

Then:

>>> df.apply(lambda col: col.str[1:-1].str.split(', '))
                         positive                negative         neutral
1  [marvel, moral, bold, destiny]                      []  [view, should]
2                     [beautiful]     [complicated, need]              []
3                     [celebrate]  [crippling, addiction]           [big]

>>> df.apply(lambda col: col.str[1:-1].str.split()).loc[3, 'negative']
['crippling', 'addiction']

And per the notes from @unutbu who came up with a similar solution:

assumes the first and last character in each cell is [ and ]. It also assumes there is exactly one space after the commas.

Alexander
  • 105,104
  • 32
  • 201
  • 196