0

I have a data frame where numeric data is stored in String with some Prefix character which I need to remove. On top of this it has double quotes inside the quotes i.e. ' "" '.

dict_1 = {"Col1" : [1001, 1002, 1003, 1004, 1005], 
          "Col2" : ['"Rs. 5131"', '"Rs. 0"', '"Rs 351157"', '"Rs 535391"', '"Rs. 6513"']}
a = pd.DataFrame(dict_1)
a.head(6)

|    | Col1     |  Col2       |       
|----|----------|-------------|
| 0  |1001      |"Rs. 5131"   |
| 1  |1002      |"Rs. 0"      | 
| 2  |1003      |"Rs 351157"  |    
| 3  |1004      |"Rs 535391"  |    
| 4  |1005      |"Rs. 6513"   |

As you can see I want to remove Quotes defined inside Col2 and along with this I have to remove Rs.

I tried following code to subset

b = a['Col2'][0]
b = b[5:]
b = b[:-1]
b

But the issue in some observation it is defined as Rs. and in some Rs without period.

The result should be a column of integers.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • All of the existing answers are to focused on prefix / suffix. The easiest solution is to extract the digits, and convert to int: `a['Col2'] = a['Col2'].str.extract('(\d+)').astype(int)` – Trenton McKinney Apr 21 '22 at 17:17

3 Answers3

2

You can simply use removeprefix and removesuffix methods for string after you get the value of the particular columns For a complete answer as comments are demanding

col3=[]
lis = dic['col2']
for b in lis:
    b=b.removeprefix('"').removesuffix('"').removeprefix("Rs.").removeprefix("Rs ")
    col3.append(int(b))
dic['col2']=col3

By this even if there will be Rs. with a period or without period both will be removed without any error. Edit: Change suggested by @Jhanzaib Humayun. I found an easier answer out there on this link for whole of the series alltogether extract number from string

ROOP AMBER
  • 330
  • 1
  • 8
  • This is not correctly implemented. Check your code. (1) These are `.str` methods, (2) `[0]` should not be on `b = a['Col2'][0]`, (3) the strings are not correct, (4) `'Rs '` and `'Rs. '` are both prefixes, and (5) this should be assigned back to the same column in the dataframe or a new dataframe column, not a separate Series. – Trenton McKinney Apr 21 '22 at 17:04
  • 1
    The current implementation results in `AttributeError: 'Series' object has no attribute 'removeprefix'` – Trenton McKinney Apr 21 '22 at 17:05
  • @TrentonMcKinney All the code I have written is after the person asking the question gets the particular cell and hence its value would be a string. As 'Rs ' and 'Rs. ' are both prefixes it will still work as removeprefix only removes and there is a prefix and doesn't written an error for clarity i have edited my answer – ROOP AMBER Apr 21 '22 at 17:48
  • Test your code, it does not work. Also _after the person asking the question gets the particular cell_ is not the correct way. It should be applied to the Series, not the cell. – Trenton McKinney Apr 21 '22 at 17:52
  • @TrentonMcKinney any of the answers prescribed will not work on a series this is because you will have to implement a for loop to go in every cell of the column 1 by 1 and than remove its prefix – ROOP AMBER Apr 21 '22 at 17:56
  • Your comment is not correct, and demonstrates a lack of understand of vectorized operations and how to work with pandas. – Trenton McKinney Apr 21 '22 at 17:57
  • @TrentonMcKinney I am sorry sir, but I still can't get what you mean to say the questionaire just wants us to tell how to remove prefix and not how to go in each and every column – ROOP AMBER Apr 21 '22 at 17:58
  • `a['Col2'] = a['Col2'].str.removeprefix('"').str.removesuffix('"').str.removeprefix("Rs. ").str.removeprefix("Rs ").astype(int)` is the correct implementation using `removeprefix` and `removesuffix`. Your answer is not correct. – Trenton McKinney Apr 21 '22 at 18:02
  • @TrentonMcKinney You can convert it to int later by b = int(b), if you want – ROOP AMBER Apr 21 '22 at 18:03
1

Or use .str.replace():

a["Col2"] = a["Col2"].str.replace('Rs. ', '').replace('"', '')

Update use replace:

a["Col2"].replace(r"Rs\.?\s+", '', regex=True, inplace=True).astype(int)
Jonathan
  • 748
  • 3
  • 20
1

Given the sample data in the OP, use .replace

a['Col2'] = a['Col2'].replace({'"': ''}, regex=True)
a['Col2'] = a['Col2'].replace({'Rs.': ''}, regex=True)
a['Col2'] = a['Col2'].replace({'Rs': ''}, regex=True)
a['Col2'] = a['Col2'].replace({' ': ''}, regex=True)
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Dan Green
  • 76
  • 4