1

I want to remove everything after the last occurrence of the _ delimiter in the HTAN Parent Biospecimen ID column.

import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [41], in <cell line: 3>()
      1 # BulkRNA-seqLevel1
      2 df_2 = pd.read_csv("syn39282161.csv", sep=",")
----> 3 df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
      4 df_2.head()

File ~/.local/lib/python3.9/site-packages/pandas/core/strings/accessor.py:129, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    124     msg = (
    125         f"Cannot use .str.{func_name} with values of "
    126         f"inferred dtype '{self._inferred_dtype}'."
    127     )
    128     raise TypeError(msg)
--> 129 return func(self, *args, **kwargs)

TypeError: rsplit() takes from 1 to 2 positional arguments but 3 were given

Data:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07_001',
  2: 'HTA10_07_006',
  3: 'HTA10_07_006'}})

Expected output:

pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
  1: 'BulkRNA-seqLevel1',
  2: 'BulkRNA-seqLevel1',
  3: 'BulkRNA-seqLevel1'},
 'Filename': {0: 'B001A001_1.fq.gz',
  1: 'B001A001_2.fq.gz',
  2: 'B001A006_1.fq.gz',
  3: 'B001A006_2.fq.gz'},
 'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
 'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
  1: 'HTA10_07',
  2: 'HTA10_07',
  3: 'HTA10_07'}})
melolili
  • 1,237
  • 6
  • 16

3 Answers3

1

try this:

df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].apply(lambda x:"_".join(x.split("_")[:-1]))
Mouad Slimane
  • 913
  • 3
  • 12
0

Earlier versions of pandas had pat and n as positional arguments, such that you could do .rsplit('_', 1) and it would work well. For example, take a look at the docs for the function signature for .str.rsplit @ pandas 1.0:

Series.str.rsplit(self, pat=None, n=- 1, expand=False)

Newer versions have defined n to be a keyword-only argument, such that you have to define n=1 explicitly now, instead of just using 1 positionally. Take the docs for .str.rsplit @ pandas 2.0:

Series.str.rsplit(pat=None, *, n=- 1, expand=False)

Notice how * is defined after pat=None, indicating that the only way to pass the parameter n now is via a keyword arg.

In a nutshell, you have to change from

df_2[col].str.rsplit("_", 1).str.get(0)

to

df_2[col].str.rsplit("_", n=1).str.get(0)

and that way, it will work for all pandas versions.

rafaelc
  • 57,686
  • 15
  • 58
  • 82
0

You can use str.replace:

>>> df['HTAN Parent Biospecimen ID'].str.replace('_\d+$', '', regex=True)
0    HTA10_07
1    HTA10_07
2    HTA10_07
3    HTA10_07
Name: HTAN Parent Biospecimen ID, dtype: object

Explanation about regex: Regex 101

Corralien
  • 109,409
  • 8
  • 28
  • 52