4

I have a file named df that looks like this:

Size       ID      File
500 TB     A       200 TB 
200 TB     B       100 TB
600 TB     C       300 TB

The numerical value along with the text, which is always 'TB', are within one column. How would I transform this and remove the 'TB' text from both columns to give me the desired output of:

Size       ID      File
500        A       200 
200        B       100 
600        C       300 

This is what I am doing:

import numpy as np
import pandas as pd

df = df[df[","] > ,] 

I am still researching this. Any insight will be helpful.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Lynn
  • 4,292
  • 5
  • 21
  • 44

5 Answers5

3

Let's try str.extract:

df[['Size','File']] = df[['Size','File']].apply(lambda x: x.str.extract('^(\d+)')[0])

Output:

  Size ID File
0  500  A  200
1  200  B  100
2  600  C  300
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
3
  • Apply str.split to the columns with pandas.DataFrame.apply, and then select the first element from the list created by .split, with .str[0].
  • This will work, as long as the pattern shown in the sample is consistent, with the undesired text after the space.
  • Using .apply in this way, will apply the lambda function to all the columns.
    • If ID has values with spaces, then this solution will cause an issue there, which can be resolved by using apply only on the columns that need to be fixed.
      • df[['Size', 'File']] = df[['Size', 'File']].apply(lambda x: x.str.split(' ').str[0])
    • If there was only one column to fix, then .apply isn't required.
      • df['Size'] = df['Size'].str.split(' ').str[0]
import pandas as pd

# test dataframe
df =  pd.DataFrame({'Size': ['500 TB', '200 TB', '600 TB'], 'ID': ['A', 'B', 'C'], 'File': ['200 TB ', '100 TB', '300 TB']})

# display(df)
     Size ID     File
0  500 TB  A  200 TB 
1  200 TB  B   100 TB
2  600 TB  C   300 TB

# apply str.split and then select the value at index [0]
df = df.apply(lambda x: x.str.split(' ').str[0])

# display(df)
  Size ID File
0  500  A  200
1  200  B  100
2  600  C  300
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
2

Try str.strip

df[['Size','File']] = df[['Size','File']].apply(lambda x: x.str.strip(' TB'))
BENY
  • 317,841
  • 20
  • 164
  • 234
1

You can use this

df['Size'] = df['Size'].map(lambda x: x.lstrip('+-').rstrip('TB'))
df['File'] = df['File'].map(lambda x: x.lstrip('+-').rstrip('TB'))
Dexter0411
  • 104
  • 1
  • 6
1

You can also replace:

df[['Size','File']] = df[['Size','File']].replace('\s+?TB','',regex=True)

  Size ID File
0  500  A  200
1  200  B  100
2  600  C  300
anky
  • 74,114
  • 11
  • 41
  • 70