0

I have a string in Python:

string = "Hello I am a 21 !string. In section 3.2.F.1.2 we covered 1topic X. On the oth1er hand, in section 1.2.F.1.1 we covered Y. Lastly, in section F.3.2 we 23 covered Z."

I need to remove the random numbers and punctuation from the text such that:

"a 21 !string" --------> "...a string..." and...

"covered 1 topic x." ---------> "covered topic"

My final string should then be:

filtered = "hello i am a string in section 3.2.F.1.2 we covered topic x on the other hand in section 1.2.F.1.1 we covered y lastly in section 1.1.F.3.2 we covered z"

...such that the codes "3.2.F.1.2", "1.2.F.1.1", and "1.1.F.3.2" were not affected by this.

I was able to generate a regex expression to specify the codes with:

regex_codes = "[\d\.]{1,4}F[\.\d]{1,4}"

all_nums_punct = "[0-9 _.,!"'/$]*"

What I cannot figure out is how to "select and remove all numbers and punctuation (all_nums_punct) except these codes (regex_code) pattern".

I tried using a "negative lookahead" pattern to ignore everything that starts with my codes from a previous stackOverflow article, but my selection is not selecting anything.

Pythoner
  • 460
  • 6
  • 23
  • 1
    So you want to remove all standalone numbers? Then you won't be able to have sentences such as `Hello I am 21 years old.` – John Gordon May 18 '20 at 21:04
  • `all_nums_punct` includes a space character. Are you sure you want to get rid of all spaces? – DarrylG May 18 '20 at 21:11
  • Well, if your regexps worked, you could use `re.sub(r'([.\d]{1,4}F[.\d]{1,4})|[0-9_.,!"'/$]', r'\1', text)`. – Wiktor Stribiżew May 18 '20 at 21:42
  • @JohnGordon That is correct, yes! – Pythoner May 19 '20 at 01:12
  • @WiktorStribiżew Can you explain what is happening here please? I understand you are substituting the expression for another. What I do not understand is the use of | separating the two, and the use of r'\1'. – Pythoner May 19 '20 at 01:14
  • @DarrylG Thanks for catching that, no, I do not want to remove the space characters. – Pythoner May 19 '20 at 01:14

1 Answers1

2

Using the regex package from the PyPI repository:

import regex

string = "Hello I am a 21 !string. In section 3.2.F.1.2 we covered 1topic X. On the oth1er hand, in section 1.2.F.1.1 we covered Y. Lastly, in section 1.1.F.3.2 we 23 covered Z."
string = regex.sub(r'''[\d\.]{1,4}F[\.\d]{1,4}(*SKIP)(*FAIL)|[0-9_.,!"'/$]''', '', string)
print(string)

Prints:

Hello I am a  string In section F we covered topic X On the other hand in section F we covered Y Lastly in section F we  covered Z

We match either your regex_codes expression or one of your all_nums_punct characters (without the space character). If we match the regex_codes expression, we skip over those characters and fail the test and try the second alternative.

The results will have possibly multiple contiguous space characters. You will need a second replace operation to replace these with a single space:

import regex

string = "Hello I am a 21 !string. In section 3.2.F.1.2 we covered 1topic X. On the oth1er hand, in section 1.2.F.1.1 we covered Y. Lastly, in section 1.1.F.3.2 we 23 covered Z."
string = regex.sub(r'''[\d\.]{1,4}F[\.\d]{1,4}(*SKIP)(*FAIL)|[0-9_.,!"'/$]''', '', string)
string = regex.sub(r' +', ' ', string)
print(string)

Prints:

Hello I am a string In section 3.2.F.1.2 we covered topic X On the other hand in section 1.2.F.1.1 we covered Y Lastly in section 1.1.F.3.2 we covered Z

Update

I will try to answer the question you posed to @WiktorStribiżew concerning how his solution below worked:

re.sub(r"""([.\d]{1,4}F[.\d]{1,4})|[0-9_.,!"'/$]'""", '\1', $string)

Whatever the regular expression matches will be replaced by '\1', which specifies the value of capture group 1. If the regular expression matches a regex_codes, then capture group 1 will be set to what ever it matches and the matched string will be replaced with itself and nothing is modified. However, if the regular expression matches one of the characters you wish to delete, then capture group 1 will be empty and the matched string will be replaced by an empty string. This method does not require the regex package. This method, too, will leave contiguous spaces, which you will probably want to remove as I have indicated.

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • Thank you! Can you explain the (*SKIP)(*FAIL) functionality please? What is happening in the backend here? – Pythoner May 19 '20 at 01:16
  • That question has been asked and answered [here](https://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex). – Booboo May 19 '20 at 01:42
  • I will also update my answer and try to explain the solution posed by @WiktorStribiżew, although I am sure he could express it better (the explanation is probably too long for a comment). – Booboo May 19 '20 at 01:59