How to remove everything after the last occurrence of a delimiter?

Question

I want to remove everything after the last occurrence of the _ delimiter in the HTAN Parent Biospecimen ID column.

import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [41], in <cell line: 3>()
      1 # BulkRNA-seqLevel1
      2 df_2 = pd.read_csv("syn39282161.csv", sep=",")
----> 3 df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
      4 df_2.head()

File ~/.local/lib/python3.9/site-packages/pandas/core/strings/accessor.py:129, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    124     msg = (
    125         f"Cannot use .str.{func_name} with values of "
    126         f"inferred dtype '{self._inferred_dtype}'."
    127     )
    128     raise TypeError(msg)
--> 129 return func(self, *args, **kwargs)

TypeError: rsplit() takes from 1 to 2 positional arguments but 3 were given

Data:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07_001',
  2: 'HTA10_07_006',
  3: 'HTA10_07_006'}})

Expected output:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})

Can you check your output? I don't understand why the first row is still HTA10_07_001? — Corralien, May 10 '23 at 13:24

score 1 · Answer 1 · answered May 10 '23 at 13:20

1

try this:

df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].apply(lambda x:"_".join(x.split("_")[:-1]))

answered May 10 '23 at 13:20

Mouad Slimane

913
3
12

1

Don't use apply when you can use vectorization. – Corralien May 10 '23 at 13:28
I m not really familiar with vectorization I m using apply a lot on work could this impact my code performance? – Mouad Slimane May 10 '23 at 13:30
1

Apply is like a python loop. You can read https://stackoverflow.com/questions/38697404/pandas-explanation-on-apply-function-being-slow – Corralien May 10 '23 at 13:38
3

Yes it's much faster and more memory efficient. However here, for strings, there is no real vectorized solution – mozway May 10 '23 at 13:38

score 0 · Answer 2 · answered May 10 '23 at 13:22

Earlier versions of pandas had pat and n as positional arguments, such that you could do .rsplit('_', 1) and it would work well. For example, take a look at the docs for the function signature for .str.rsplit @ pandas 1.0:

Series.str.rsplit(self, pat=None, n=- 1, expand=False)

Newer versions have defined n to be a keyword-only argument, such that you have to define n=1 explicitly now, instead of just using 1 positionally. Take the docs for .str.rsplit @ pandas 2.0:

Series.str.rsplit(pat=None, *, n=- 1, expand=False)

Notice how * is defined after pat=None, indicating that the only way to pass the parameter n now is via a keyword arg.

In a nutshell, you have to change from

df_2[col].str.rsplit("_", 1).str.get(0)

to

df_2[col].str.rsplit("_", n=1).str.get(0)

and that way, it will work for all pandas versions.

score 0 · Accepted Answer · answered May 10 '23 at 13:25

0

You can use str.replace:

>>> df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0    HTA10_07
1    HTA10_07
2    HTA10_07
3    HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object

Explanation about regex: Regex 101

answered May 10 '23 at 13:25

Corralien

109,409
8
28
52

How to remove everything after the last occurrence of a delimiter?

3 Answers3