substring of an entire column in pandas dataframe

Question

I have a pandas dataframe "df". In this dataframe I have multiple columns, one of which I have to substring. Lets say the column name is "col". I can run a "for" loop like below and substring the column:

for i in range(0,len(df)):
  df.iloc[i].col = df.iloc[i].col[:9]

But I wanted to know, if there is an option where I don't have to use a "for" loop, and do it directly using an attribute.I have huge amount of data, and if I do this, the data will take a very long time process.

ayhan · Accepted Answer · 2019-12-05T08:32:27.103

222

Use the str accessor with square brackets:

df['col'] = df['col'].str[:9]

Or str.slice:

df['col'] = df['col'].str.slice(0, 9)

edited Dec 05 '19 at 08:32

answered Apr 08 '16 at 17:42

ayhan

70,170
20
182
203

5

This gives me the dreaded `SettingWithCopyWarning:` – demongolem Jun 17 '20 at 11:27
great solution! but curious about which one is faster on large dataset... And how about compare to `df['col'] = [x[:9] for x in df['col']]` – Peter Chen Oct 06 '20 at 22:02
You can use the suggested solution with `pd.options.mode.chained_assignment = None # default='warn' ` to get rid of the warning. Alternatively, you can look at these topics and lose a few minutes of your life: [link](https://stackoverflow.com/questions/42379818/correct-way-to-set-new-column-in-pandas-dataframe-to-avoid-settingwithcopywarnin) – Charles Mar 22 '21 at 10:04
Doesn't work for me : `val_data['col2'] = val_data['col1'].str[:4]` >> `ValueError: Length mismatch: Expected axis has 9 elements, new values have 8 elements. ` – GenDemo May 26 '23 at 02:14

score 14 · Answer 2 · edited May 07 '22 at 18:02

14

In case the column isn't a string, use astype to convert it:

df['col'] = df['col'].astype(str).str[:9]

edited May 07 '22 at 18:02

rachwa

1,805
1
14
17

answered Jan 19 '21 at 13:59

Elton da Mata

181
1
6

score 8 · Answer 3 · answered Oct 04 '22 at 13:06

As one doesn't know exactly OP's dataframe, one can create one to be used as test.

df = pd.DataFrame({'col': {0: '2020-12-08', 1: '2020-12-08', 2: '2020-12-08', 3: '2020-12-08', 4: '2020-12-08', 5: '2020-12-08', 6: '2020-12-08', 7: '2020-12-08', 8: '2020-12-08', 9: '2020-12-08'}})

[Out]:
          col
0  2020-12-08
1  2020-12-08
2  2020-12-08
3  2020-12-08
4  2020-12-08
5  2020-12-08
6  2020-12-08
7  2020-12-08
8  2020-12-08
9  2020-12-08

Assuming one wants to store the column in the same dataframe df, and that we want to keep only 4 characters, on a column called col_substring, there are various options one can do.

Option 1

Using pandas.Series.str

df['col_substring'] = df['col'].str[:4]

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Option 2

Using pandas.Series.str.slice as follows

df['col_substring'] = df['col'].str.slice(0, 4)

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

or like this

df['col_substring'] = df['col'].str.slice(stop=4)

Option 3

Using a custom lambda function

df['col_substring'] = df['col'].apply(lambda x: x[:4])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Option 4

Using a custom lambda function with a regular expression (with re)

import re

df['col_substring'] = df['col'].apply(lambda x: re.findall(r'^.{4}', x)[0])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Option 5

Using numpy.vectorize

df['col_substring'] = np.vectorize(lambda x: x[:4])(df['col'])

[Out]:

          col col_substring
0  2020-12-08          2020
1  2020-12-08          2020
2  2020-12-08          2020
3  2020-12-08          2020
4  2020-12-08          2020
5  2020-12-08          2020
6  2020-12-08          2020
7  2020-12-08          2020
8  2020-12-08          2020
9  2020-12-08          2020

Note:

The ideal solution would depend on the use case, constraints, and the dataframe.

Option 3 combines nicely with `find`, `rfind` and other string methods that you'd typically combine with slicing. — RubenGeert, Mar 16 '23 at 17:54

substring of an entire column in pandas dataframe

3 Answers3

Linked

Related