how can I remove point numbers: '(i)', '(ii)', '(iii)' in answers with regex from pandas dataframe?

Question

Suppose, I have a pandas dataframe consisting many rows for product name and columns describing their respective features. And they add some numbering system like 1., 2.,3.,... or a),b),c)....or (i),(ii),(iii),... etc. Now I want to remove them in data frame.

df.replace(regex=True, inplace=True, to_replace=r'["(i*)"|i*.|(a-zA-Z).|("("a-zA-z")")]', value=r'')

but the code is not working. It deletes all i's from the answers eg. consider becomes consder and I can remove a., b. etc if I give it individually i.e., to_replace=r'[a.|b.|A.|B.] but if the pattern is given, it's not working.

how can I remove '(i)', '(ii)', '(iii)' and '(a)', '(A)', 'a.', 'A.' ranges from A-Z and i for one or more with regex pandas dataframe?

Example

INPUT
(i) The cow has four legs. (ii) The cow eats grass. (iii) Cow gives us milk.

OR

a.The cow has four legs. b.The cow eats grass. c.Cow gives us milk.

OUTPUT
The cow has four legs. The cow eats grass. Cow gives us milk.

I think you need to use escapes for all the `.` literals, as well as for the literal parentheses. Also, why using `a-zA-Z` instead of the group `[a-zA-Z]`? Have you used a tool like regex101 to experiment with your regex? — Green Cloak Guy, May 22 '21 at 03:20
Off the top of my head, a regex like `r'\(i+\)|\(?[a-zA-Z]\)|[0-9]+\.'` should work for examples `1.`, `a)`, `(a)`, and `(i)` — Green Cloak Guy, May 22 '21 at 03:22
Perhaps `(?:\(?(?:[vx]?i{1,3}|i?[vx])\)|(?:\(|\b)(?:\d{1,2}|[a-z])[.)]) *`? https://regex101.com/r/6Oyc5Z/2/ — Nick, May 22 '21 at 04:10
I did not find any lead. So, I have to converting DataFrame into string to get the work done... import re string = "a. at what time? i) 8pm, i. 9pm, B. 10pm." match = re.sub("(a\\.|b\\.|B\\.|i+\\)|i+\\.)","",string) print (match) >>>at what time? 8pm, 9pm, 10pm. how to write it in shortcut? — souvik datta, May 22 '21 at 09:30

tshiono · Accepted Answer · 2021-05-22T11:26:12.820

Would you please try:

df.replace(regex=True, inplace=True, to_replace=r'^\(?(?:[ivxlcdm]+|[a-zA-Z]+|[0-9]+)[).]', value='')

Input:

(i) The cow has four legs.
(ii) The cow eats grass.
(iii) Cow gives us milk.
a.The cow has four legs.
b.The cow eats grass.
c.Cow gives us milk.
1.The cow has four legs.
2.The cow eats grass.
3.Cow gives us milk.
a)The cow has four legs.
b)The cow eats grass.
c)Cow gives us milk.

Output:

The cow eats grass.
Cow gives us milk.
The cow has four legs.
The cow eats grass.
Cow gives us milk.
The cow has four legs.
The cow eats grass.
Cow gives us milk.
The cow has four legs.
The cow eats grass.
Cow gives us milk.

Explanation of the regex ^\(?(?:[ivxlcdm]+|[a-zA-Z]+|[0-9]+)[).]:

^ indicates the start of the string.
\(? matches a zero or one left parenthesis.
(?:[ivxlcdm]+|[a-zA-Z]+|[0-9]+) can be broken down either of:
- [ivxlcdm]+ which matches Roman numerals.
- [a-zA-Z]+ which matches alphabets.
- [0-9]+ which matches digits.
[).] matches a right parenthesis or a dot.

r'[i]+' removes all i's from the sentence but I want to remove only (i) or (ii..) this pattern. — souvik datta, May 22 '21 at 07:57
It does not. My regex is anchored with `^` and matches only within parens or before `)` or `.`. Have you actually tested? — tshiono, May 22 '21 at 10:06

score 1 · Answer 2 · answered May 22 '21 at 09:45

1

If there can only be 1 or more times an i char (so no roman numerals) you might use:

\(?i+\)|\b(?:[A-Za-z]|\d+)\.

The pattern matches:

\(?i+\) Match an optional (, then 1+ times an i char and a )
| Or
\b A word boundary to prevent a partial match
(?: Non capture group
- [A-Za-z] Match a single char A-Za-z
- | Or
- \d+ Match 1+ digits
) Close non capture group
\. Match a dot

Regex demo

If you want to match roman numerals, you can see this post.

answered May 22 '21 at 09:45

The fourth bird

154,723
16
55
70

@Nick : thank you...it is a great help for me. – souvik datta May 22 '21 at 15:55
@souvikdatta Let me say "you are welcome" in the name of Nick and tshiono – The fourth bird May 22 '21 at 17:30

how can I remove point numbers: '(i)', '(ii)', '(iii)' in answers with regex from pandas dataframe?

2 Answers2