I want to remove everything after the last occurrence of the _
delimiter in the HTAN Parent Biospecimen ID
column.
import pandas as pd
df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
Traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [41], in <cell line: 3>()
1 # BulkRNA-seqLevel1
2 df_2 = pd.read_csv("syn39282161.csv", sep=",")
----> 3 df_2["HTAN Parent Biospecimen ID"] = df_2["HTAN Parent Biospecimen ID"].str.rsplit("_", 1).str.get(0)
4 df_2.head()
File ~/.local/lib/python3.9/site-packages/pandas/core/strings/accessor.py:129, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
124 msg = (
125 f"Cannot use .str.{func_name} with values of "
126 f"inferred dtype '{self._inferred_dtype}'."
127 )
128 raise TypeError(msg)
--> 129 return func(self, *args, **kwargs)
TypeError: rsplit() takes from 1 to 2 positional arguments but 3 were given
Data:
pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
1: 'BulkRNA-seqLevel1',
2: 'BulkRNA-seqLevel1',
3: 'BulkRNA-seqLevel1'},
'Filename': {0: 'B001A001_1.fq.gz',
1: 'B001A001_2.fq.gz',
2: 'B001A006_1.fq.gz',
3: 'B001A006_2.fq.gz'},
'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
1: 'HTA10_07_001',
2: 'HTA10_07_006',
3: 'HTA10_07_006'}})
Expected output:
pd.DataFrame({'Component': {0: 'BulkRNA-seqLevel1',
1: 'BulkRNA-seqLevel1',
2: 'BulkRNA-seqLevel1',
3: 'BulkRNA-seqLevel1'},
'Filename': {0: 'B001A001_1.fq.gz',
1: 'B001A001_2.fq.gz',
2: 'B001A006_1.fq.gz',
3: 'B001A006_2.fq.gz'},
'File Format': {0: 'fastq', 1: 'fastq', 2: 'fastq', 3: 'fastq'},
'HTAN Parent Biospecimen ID': {0: 'HTA10_07_001',
1: 'HTA10_07',
2: 'HTA10_07',
3: 'HTA10_07'}})