Remove ends of string entries in pandas DataFrame column

Question

I have a pandas Dataframe with one column a list of files

import pandas as pd
df = pd.read_csv('fname.csv')

df.head()

filename    A    B    C
fn1.txt   2    4    5
fn2.txt   1    2    1
fn3.txt   ....
....

I would like to delete the file extension .txt from each entry in filename. How do I accomplish this?

I tried:

df['filename'] = df['filename'].map(lambda x: str(x)[:-4])

but when I look at the column entries afterwards with df.head(), nothing has changed.

How does one do this?

is your data `'fn1.txt'` or `fn1.txt`? so do you have quoted strings as your data? — EdChum, May 03 '16 at 10:35

jezrael · Accepted Answer · 2016-05-03T18:35:47.333

54

I think you can use str.replace with regex .txt$' ( $ - matches the end of the string):

import pandas as pd

df = pd.DataFrame({'A': {0: 2, 1: 1}, 
                   'C': {0: 5, 1: 1}, 
                   'B': {0: 4, 1: 2}, 
                   'filename': {0: "txt.txt", 1: "x.txt"}}, 
                columns=['filename','A','B', 'C'])

print df
  filename  A  B  C
0  txt.txt  2  4  5
1    x.txt  1  2  1

df['filename'] = df['filename'].str.replace(r'.txt$', '')
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

df['filename'] = df['filename'].map(lambda x: str(x)[:-4])
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

df['filename'] = df['filename'].str[:-4]
print df
  filename  A  B  C
0      txt  2  4  5
1        x  1  2  1

EDIT:

rstrip can remove more characters, if the end of strings contains some characters of striped string (in this case ., t, x):

Example:

print df
  filename  A  B  C
0  txt.txt  2  4  5
1    x.txt  1  2  1

df['filename'] = df['filename'].str.rstrip('.txt')

print df
  filename  A  B  C
0           2  4  5
1           1  2  1

edited May 03 '16 at 18:35

answered May 03 '16 at 10:38

jezrael

822,522
95
1,334
1,252

I get the following `AttributeError`: AttributeError: 'str' object has no attribute 'str' – ShanZhengYang May 03 '16 at 10:48
1

Try cast column `filename` to `string` - `df['filename'] = df['filename'].astype(str)` – jezrael May 03 '16 at 10:49
Even casting the column as a string, none of these methods work – ShanZhengYang May 03 '16 at 11:10
It works with dataframe in my solution? `df = pd.DataFrame({'A': {0: 2, 1: 1}, 'C': {0: 5, 1: 1}, 'B': {0: 4, 1: 2}, 'filename': {0: "txt.txt", 1: "x.txt"}}, columns=['filename','A','B', 'C'])` ? – jezrael May 03 '16 at 11:12
What is your version of `pandas`? `print pd.show_versions()` – jezrael May 03 '16 at 11:13
I get an idea. Maybe help reset_index. Try `df = df.reset_index()`, because maybe first column `filename` is not column, but `index`. – jezrael May 03 '16 at 11:19
'0.18.0' is the version for pandas – ShanZhengYang May 03 '16 at 11:24
Super. And I think help reseting index. – jezrael May 03 '16 at 11:25
For `type(dropped_newdf['filename'].values)`, I get `numpy.ndarray` resetting the index didn't work – ShanZhengYang May 03 '16 at 11:26
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/110901/discussion-between-jezrael-and-shanzhengyang). – jezrael May 03 '16 at 11:28
Sorry, but what you get `type(dropped_newdf.loc[0, 'filename'])` ? – jezrael May 03 '16 at 12:20

EdChum · Answer 2 · 2016-05-03T10:39:28.907

12

You can use str.rstrip to remove the endings:

df['filename'] = df['filename'].str.rstrip('.txt')

should work

edited May 03 '16 at 10:39

answered May 03 '16 at 10:33

EdChum

376,765
198
813
562

don't know it should've worked but the question remains does your data have quotes or not? – EdChum May 03 '16 at 10:36
When I use `df.head()`, I see no quotes – ShanZhengYang May 03 '16 at 10:46
So, when I try the above code, I get the following error 'AttributeError: 'str' object has no attribute 'str'' – ShanZhengYang May 03 '16 at 10:47
what version of pandas are you using? vectorised `str` accessor was added in `0.15.0` I think – EdChum May 03 '16 at 10:49
4

I think `rstrip` method can be problematic, if last character of file contain `t`, `e` or `x`. Then this char is removed. Try `df = pd.DataFrame({'A': {0: 2, 1: 1}, 'C': {0: 5, 1: 1}, 'B': {0: 4, 1: 2}, 'filename': {0: "test.txt", 1: "x.txt"}}, columns=['filename','A','B', 'C'])` – jezrael May 03 '16 at 10:55
1

Sorry, not `e`. only chars `t` and `x`. – jezrael May 03 '16 at 11:10

score 8 · Answer 3 · answered May 03 '16 at 11:44

8

You may want:

df['filename'] = df.apply(lambda x: x['filename'][:-4], axis = 1)

answered May 03 '16 at 11:44

Paweł Kordek

101
2

How about if instead of a fixed size of -4 you I need something more flexible say, get rid off the last words after the comma or period? – Ricardo Sanchez Jun 03 '21 at 12:20

score 2 · Answer 4 · answered May 03 '16 at 13:59

2

use list comprehension

df['filename'] = [x[:-4] for x in df['filename']]

answered May 03 '16 at 13:59

Blue Moon

4,421
20
52
91

i want to delete last or first character if the last or first character is "X". Example please – Rakesh Bhagam May 24 '20 at 10:44

score 2 · Answer 5 · answered Dec 01 '21 at 00:18

update 2021 + speedtest

Starting from pandas 1.4, the equivalent of str.removesuffix, the pandas.Series.str.removesuffix is implemented, so one can use

df['filename'].str.removesuffix('.txt')

speed test

tl;dr: the fastest is

dat["fname"].map(lambda x: x[:-4] if x[-4:] == ".txt" else x)

In the speed test, I wanted to consider the different methods collected in this SO page. I excluded rstrip, because it would strip other than .txt endings too, and as regexp contains conditional, therefore it would be fair to modify the other functions too so that they remove the last 4 chars only if they are .txt.

The testing code is

import pandas as pd
import time

ITER = 10


def rm_re(dat: pd.DataFrame) -> pd.Series:
    """Use regular expression."""
    return dat["fname"].str.replace(r'.txt$', '', regex=True)


def rm_map(dat: pd.DataFrame) -> pd.Series:
    """Use pandas map, find occurrences and remove with []"""
    where = dat["fname"].str.endswith(".txt")
    dat.loc[where, "fname"] = dat["fname"].map(lambda x: x[:-4])
    return dat["fname"]


def rm_map2(dat: pd.DataFrame) -> pd.Series:
    """Use pandas map with lambda conditional."""
    return dat["fname"].map(lambda x: x[:-4] if x[-4:] == ".txt" else x)


def rm_apply_str_suffix(dat: pd.DataFrame) -> pd.Series:
    """Use str method suffix with pandas apply"""
    return dat["fname"].apply(str.removesuffix, args=(".txt",))


def rm_suffix(dat: pd.DataFrame) -> pd.Series:
    """Use pandas removesuffix from version 1.6"""
    return dat["fname"].str.removesuffix(".txt")


functions = [rm_map2, rm_apply_str_suffix, rm_map, rm_suffix, rm_re]
for base in range(12, 23):
    size = 2**base
    data = pd.DataFrame({"fname": ["fn"+str(i) for i in range(size)]})
    data.update(data.sample(frac=.5)["fname"]+".txt")
    for func in functions:
        diff = 0
        for _ in range(ITER):
            data_copy = data.copy()
            start = time.process_time()
            func(data_copy)
            diff += time.process_time() - start

        print(diff, end="\t")

The output is plotted below:

It can be seen from the plot that the slowest solution is the regexp, and the fastest is the pandas.Series.map with a conditional. In later versions of pandas, this may change and I'd expect an improvement in pandas.Series.str.removesuffix, as it has a greater potential in vectorization.

Pandas had to be installed from the source as of 2021-11-30, because version 1.4 is in the developement stage only. I installed it by following the instructions from pandas dev repo, by cloning the project and installing with python setup.py install.

My machine:

AMD Ryzen 5 2400G with Radeon Vega Graphics, 3.60 GHz
Windows 10 20H2
Python 3.10.0, pandas.version '1.4.0.dev0+1267.gaee662a7e3', numpy.version '1.21.4'

score 0 · Answer 6 · edited Oct 23 '22 at 16:07

0

I had the same problem. You can simply do:

df['filename'] = df['filename'].astype('str').str.rstrip('.0')

Remember to add .astype('str') to cast it to str otherwise, you might get the following error:

AttributeError: Can only use .str accessor with string values!

edited Oct 23 '22 at 16:07

Javad

2,033
3
13
23

answered Sep 28 '22 at 07:30

Pedro Nalin

1
1

Remove ends of string entries in pandas DataFrame column

6 Answers6

update 2021 + speedtest

speed test

Linked

Related