Splitting objects of different lengths in panda series

Question

Python/pandas beginner here.

I have a pandas series (column of a larger df), what looks like this:

0                                   ['0344010000122413']
1                                   ['0344010000132886']
2                                   ['0344010000021642']
3      ['0344010000010731', '0344010000010732', '0344...
4                                   ['0344010000025264']
Name: NUMPOINTS, Length: 271, dtype: object

The length of each NUMPPOINT = 16. The number of NUMPOINTS per row differs from 0 to ±100.

As you can see, the dtype of the series is an object. The goal is that I want to convert each row in this series into real lists and the numbers to integers, but this cannot happen because of the dtype and the [' and ']. The variable length per row makes it not possible to use certain functions.

I used df['NUMPOINTS'] = df.NUMPOINTS.apply(lambda x: x[2:-2].split(',')) but that only works for rows with 1 NUMPOINT.

I used the df['NUMPOINTS'].replace(regex=True,inplace=True,to_replace=r'\D',value=r'') function, but this 'sticks' the objects together. For example, index 3 gives:

3      0344010000010731034401000001073203440100000107...

Then converting to integers gives an error.

I used the solutions in this question pandas - convert string into list of strings but did not do the job either. Am I missing something here?

EDIT: Trying https://stackoverflow.com/users/10035985/andrej-kesely updated answer gives me:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-374-5f4f43cc7fc1> in <module>()
      1 from ast import literal_eval
      2 df["NUMPOINTS"] = df["NUMPOINTS"].apply(
----> 3     lambda x: [
      4         int(value) for value in (literal_eval(x) if isinstance(x, str) else x)
      5     ]

2 frames
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-374-5f4f43cc7fc1> in <listcomp>(.0)
      2 df["NUMPOINTS"] = df["NUMPOINTS"].apply(
      3     lambda x: [
----> 4         int(value) for value in (literal_eval(x) if isinstance(x, str) else x)
      5     ]
      6 )

ValueError: invalid literal for int() with base 10: "0344010000010731'"

Andrej Kesely · Answer 1 · 2021-05-27T14:11:25.413

0

You can apply ast.literal_eval and then int() inside list comprehension:

from ast import literal_eval

df["NUMPOINTS"] = df["NUMPOINTS"].apply(
    lambda x: [int(value) for value in literal_eval(x)]
)
print(df)

Prints:

                            NUMPOINTS
0                   [344010000122413]
1                   [344010000132886]
2                   [344010000021642]
3  [344010000010731, 344010000010732]
4                   [344010000025264]

EDIT:

If you have strings/lists in your column:

df["NUMPOINTS"] = df["NUMPOINTS"].apply(
    lambda x: [
        int(value.strip("'")) for value in (literal_eval(x) if isinstance(x, str) else x)
    ]
)
print(df)

edited May 27 '21 at 14:11

answered May 27 '21 at 13:23

Andrej Kesely

168,389
15
48
91

This gives me this error: ```ValueError: malformed node or string: ``` – QB-science May 27 '21 at 13:33
@QB-science Do you have `NaN` values in the column? Do you have some strings other than `[ ... ]` in the column? – Andrej Kesely May 27 '21 at 13:38
No ``` NaN``` but there are some rows with only ```[]``` – QB-science May 27 '21 at 13:40
@QB-science That shouldn't be a problem. Edit your question and put full error traceback there (with correct formatting). – Andrej Kesely May 27 '21 at 13:41
Thanks for your help, but that still does not do the job. I updated the error traceback. – QB-science May 27 '21 at 14:00
@QB-science Try `int(value.strip("'"))` I updated my answer. But really, seems you have "dirty" data in your column... – Andrej Kesely May 27 '21 at 14:12
Can I also use the ``` int(value.strip("'")``` with an extra ```.strip``` for spaces? – QB-science May 27 '21 at 14:28
@QB-science Yes, change your values the way it's correct integer - without spaces, `'` etc. – Andrej Kesely May 27 '21 at 14:29

Anurag Dabas · Answer 2 · 2021-05-27T14:12:28.803

0

import pandas as pd

You can also do with map() and pd.eval() method:

df['NUMPOINTS']=df['NUMPOINTS'].map(lambda x: [int(y.lstrip('0').rstrip("'")) for y in (pd.eval(x) if type(x).__name__=='str' else x)])

Now if you print df you will get:

                            NUMPOINTS
0                   [344010000122413]
1                   [344010000132886]
2                   [344010000021642]
3  [344010000010731, 344010000010732]
4                   [344010000025264]

edited May 27 '21 at 14:12

answered May 27 '21 at 13:27

Anurag Dabas

23,866
9
21
41

This solution gives me this Syntax Error: ```File "", line 1 [0 344010000122413 ] ^ SyntaxError: invalid syntax ``` – QB-science May 27 '21 at 13:32
Thanks for your help, but the updated answer gives me ``` File "", line 1 [0 344010000122413 ] ^ SyntaxError: invalid syntax ``` – QB-science May 27 '21 at 13:38

Splitting objects of different lengths in panda series

2 Answers2