pandas - convert string into list of strings

Question

I have this 'file.csv' file to read with pandas:

Title|Tags
T1|"[Tag1,Tag2]"
T1|"[Tag1,Tag2,Tag3]"
T2|"[Tag3,Tag1]"

using

df = pd.read_csv('file.csv', sep='|')

the output is:

  Title              Tags
0    T1       [Tag1,Tag2]
1    T1  [Tag1,Tag2,Tag3]
2    T2       [Tag3,Tag1]

I know that the column Tags is a full string, since:

In [64]: df['Tags'][0][0]
Out[64]: '['

I need to read it as a list of strings like ["Tag1","Tag2"]. I tried the solution provided in this question but no luck there, since I have the [ and ] characters that actually mess up the things.

The expecting output should be:

In [64]: df['Tags'][0][0]
Out[64]: 'Tag1'

I asked a question similar to this before, you can see the answers here: https://stackoverflow.com/questions/44529483/python-convert-string-list-to-list — Ahmed, Aug 18 '17 at 14:15

score 49 · Accepted Answer · answered Aug 18 '17 at 14:15

49

You can split the string manually:

>>> df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))
>>> df.Tags[0]
['Tag1', 'Tag2']

answered Aug 18 '17 at 14:15

Mike Müller

82,630
20
166
161

14

Or apply it on load...`df = pd.read_csv('file.csv', sep='|', converters={'Tags': lambda x: x[1:-1].split(',')})` – Jon Clements Aug 18 '17 at 14:23
@JonClements, `converters={'Tags': lambda x: x[1:-1].split(',')}` just saved me so much headache. Thanks for this. – Brendan Apr 15 '18 at 21:05

rdmolony · Answer 2 · 2021-10-01T08:22:31.353

You could use the inbuilt ast.literal_eval, it works for tuples as well as lists

import ast
import pandas as pd

df = pd.DataFrame({"mytuples": ["(1,2,3)"]})

print(df.iloc[0,0])
# >> '(1,2,3)'

df["mytuples"] = df["mytuples"].apply(ast.literal_eval)

print(df.iloc[0,0])
# >> (1,2,3)

EDIT: eval should be avoided! If the the string being evaluated is os.system(‘rm -rf /’) it will start deleting all the files on your computer (here). For ast.literal_eval the string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, and None (here). Thanks @TrentonMcKinney :)

best solution, the other solutions did not worked for me – Talha Anwar Oct 03 '22 at 10:24 — Talha Anwar, Oct 03 '22 at 10:24

score 13 · Answer 3 · answered Aug 18 '17 at 14:13

13

I think you could use the json module.

import json
import pandas

df = pd.read_csv('file.csv', sep='|')
df['Tags'] = df['Tags'].apply(lambda x: json.loads(x))

So this will load your dataframe as before, then apply a lambda function to each of the items in the Tags column. The lambda function calls json.loads() which converts the string representation of the list to an actual list.

answered Aug 18 '17 at 14:13

RHSmith159

1,823
9
16

3

I think this is a better solution, less prone to errors! Also, note that you can pass `json.loads` directly as an `apply` parameter: `df['Tags'].apply(json.loads)` – Yevhen Kuzmovych Oct 02 '19 at 09:24

score 11 · Answer 4 · answered Aug 18 '17 at 14:22

11

Or

df.Tags=df.Tags.str[1:-1].str.split(',').tolist()

answered Aug 18 '17 at 14:22

BENY

317,841
20
164
234

2

@WeNToBen - nice solution. Care to expand on it a little bit? why do we need `str[1:-1]`, why is it not `str[0:-1]`? (for me both yield the same result by the way). Also, if `split()` already creates a list, why do we explicitly call `tolist()`? – zerohedge Jun 23 '19 at 18:12
2

@zerohedge cause you want to remove the "[" at the beginning and "]" at the end – BENY Jun 23 '19 at 18:13
1

thanks. and why `tolist()` after `split()` (which itself creates a list, no?) – zerohedge Jun 23 '19 at 18:17
2

@zerohedge ah , that one I need to remove ,you are right – BENY Jun 23 '19 at 18:17

score 5 · Answer 5 · answered Aug 18 '17 at 14:15

5

You can convert the string to a list using strip and split.

df_out = df.assign(Tags=df.Tags.str.strip('[]').str.split(','))

df_out.Tags[0][0]

Output:

'Tag1'

answered Aug 18 '17 at 14:15

Scott Boston

147,308
15
139
187

score 2 · Answer 6 · answered Aug 18 '17 at 14:35

Your df['Tags'] appears to be a list of strings. If you print that list you should get ["[tag1,tag2]","[Tag1,Tag2,Tag3]","[Tag3,Tag1]"] this is why when you call the first element of the first element you're actually getting the first single character of the string, rather than what you want.

You either need to parse that string afterward. Performing something like

df['Tags'][0] = df['Tags'][0].split(',')

But as you saw in your cited example this will give you a list that looks like

in: df['Tags'][0][0] 
out: '[tag1'`

What you need is a way to parse the string editing out multiple characters. You can use a simple regex expression to do this. Something like:

 import re
 df['Tags'][0] = re.findall(r"[\w']+", df['Tags'][0])
 print(df['Tags'][0][0])

will print:

 'tag1'

Using the other answer involving Pandas converters you might write a converter like this:

 def clean(seq_string):
      return re.findall(r"[\w']+", seq_string)

If you don't know regex, they can be quite powerful, but also unpredictable if you're not sure on the content of your input strings. The expression used here r"[\w']+" will match any common word character alpha-numeric and underscores and treat everything else as a point for re.findall to split the list at.

pandas - convert string into list of strings

6 Answers6

Linked

Related