2

Suppose, I have a pandas dataframe consisting many rows for product name and columns describing their respective features. And they add some numbering system like 1., 2.,3.,... or a),b),c)....or (i),(ii),(iii),... etc. Now I want to remove them in data frame.

df.replace(regex=True, inplace=True, to_replace=r'["(i*)"|i*.|(a-zA-Z).|("("a-zA-z")")]', value=r'')

but the code is not working. It deletes all i's from the answers eg. consider becomes consder and I can remove a., b. etc if I give it individually i.e., to_replace=r'[a.|b.|A.|B.] but if the pattern is given, it's not working.

how can I remove '(i)', '(ii)', '(iii)' and '(a)', '(A)', 'a.', 'A.' ranges from A-Z and i for one or more with regex pandas dataframe?

Example

INPUT
(i) The cow has four legs. (ii) The cow eats grass. (iii) Cow gives us milk.

OR

a.The cow has four legs. b.The cow eats grass. c.Cow gives us milk.

OUTPUT
The cow has four legs. The cow eats grass. Cow gives us milk.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    I think you need to use escapes for all the `.` literals, as well as for the literal parentheses. Also, why using `a-zA-Z` instead of the group `[a-zA-Z]`? Have you used a tool like regex101 to experiment with your regex? – Green Cloak Guy May 22 '21 at 03:20
  • Off the top of my head, a regex like `r'\(i+\)|\(?[a-zA-Z]\)|[0-9]+\.'` should work for examples `1.`, `a)`, `(a)`, and `(i)` – Green Cloak Guy May 22 '21 at 03:22
  • 1
    Perhaps `(?:\(?(?:[vx]?i{1,3}|i?[vx])\)|(?:\(|\b)(?:\d{1,2}|[a-z])[.)]) *`? https://regex101.com/r/6Oyc5Z/2/ – Nick May 22 '21 at 04:10
  • I did not find any lead. So, I have to converting DataFrame into string to get the work done... import re string = "a. at what time? i) 8pm, i. 9pm, B. 10pm." match = re.sub("(a\\.|b\\.|B\\.|i+\\)|i+\\.)","",string) print (match) >>>at what time? 8pm, 9pm, 10pm. how to write it in shortcut? – souvik datta May 22 '21 at 09:30

2 Answers2

1

Would you please try:

df.replace(regex=True, inplace=True, to_replace=r'^\(?(?:[ivxlcdm]+|[a-zA-Z]+|[0-9]+)[).]', value='')

Input:

(i) The cow has four legs.
(ii) The cow eats grass.
(iii) Cow gives us milk.
a.The cow has four legs.
b.The cow eats grass.
c.Cow gives us milk.
1.The cow has four legs.
2.The cow eats grass.
3.Cow gives us milk.
a)The cow has four legs.
b)The cow eats grass.
c)Cow gives us milk.

Output:

The cow eats grass.
Cow gives us milk.
The cow has four legs.
The cow eats grass.
Cow gives us milk.
The cow has four legs.
The cow eats grass.
Cow gives us milk.
The cow has four legs.
The cow eats grass.
Cow gives us milk.

Explanation of the regex ^\(?(?:[ivxlcdm]+|[a-zA-Z]+|[0-9]+)[).]:

  • ^ indicates the start of the string.
  • \(? matches a zero or one left parenthesis.
  • (?:[ivxlcdm]+|[a-zA-Z]+|[0-9]+) can be broken down either of:
    • [ivxlcdm]+ which matches Roman numerals.
    • [a-zA-Z]+ which matches alphabets.
    • [0-9]+ which matches digits.
  • [).] matches a right parenthesis or a dot.
tshiono
  • 21,248
  • 2
  • 14
  • 22
  • r'[i]+' removes all i's from the sentence but I want to remove only (i) or (ii..) this pattern. – souvik datta May 22 '21 at 07:57
  • It does not. My regex is anchored with `^` and matches only within parens or before `)` or `.`. Have you actually tested? – tshiono May 22 '21 at 10:06
1

If there can only be 1 or more times an i char (so no roman numerals) you might use:

\(?i+\)|\b(?:[A-Za-z]|\d+)\.

The pattern matches:

  • \(?i+\) Match an optional (, then 1+ times an i char and a )
  • | Or
  • \b A word boundary to prevent a partial match
  • (?: Non capture group
    • [A-Za-z] Match a single char A-Za-z
    • | Or
    • \d+ Match 1+ digits
  • ) Close non capture group
  • \. Match a dot

Regex demo

If you want to match roman numerals, you can see this post.

The fourth bird
  • 154,723
  • 16
  • 55
  • 70