1

For a string like:

"The temperature today is 30.8 degrees celsius."

How do I remove the full stop, especially for a longer string involving multiple sentences and decimal points?

I do know that there are similar questions posted, but it was in javascript or c#. As a beginner in programming as a whole, I don't really get the symbols they used nor how to translate it to Python.

CipherBot
  • 113
  • 5
  • 1
    I am not a python master, but this regular expression will help.. ([^\d]\.) Use this regex to find out all DOTS that does not appear within numbers like 98.456 but appears within texts like ss.ss Here is a post from stackoverflow on how to use regex to replace strings https://stackoverflow.com/questions/5658369/how-to-input-a-regex-in-string-replace – Rakesh Mehta Apr 06 '20 at 13:03
  • 2
    As @RakeshMehta mentioned, regex is probably the way you want to go. Just keep in mind sometimes, context matters. For example, the sentences `Today I am 30.8 days after today is my birthday.` Would qualify as a decimal in everything except for context. Granted, there should be a space before 8 but if your program accepts any input, there would be nothing preventing this from being parsed as a decimal. – Axe319 Apr 06 '20 at 13:12

1 Answers1

1

One quick solution could indeed be a regular expression as suggested in the comments if you can afford to look-up all your data and see what easy rule would be sufficient.

If you have a lot of variety in your data, take advantage of a proxy task: sentence tokenization. In fact, if you manage to split sentences, you're basically done.

For that, don't reinvent the wheel, use available sentence tokenizers:

>>> from nltk.tokenize.punkt import PunktSentenceTokenizer                                                                               
>>> tokenizer = PunktSentenceTokenizer()   
>>> sentences = tokenizer.tokenize("The temperature today is 30.8 degrees celsius. However yesterday at 12:00 A.M., M. John said it was 27.1 degrees.") 
>>> print(sentences)
['The temperature today is 30.8 degrees celsius.',
 'However yesterday at 12:00 A.M., M. John said it was 27.1 degrees.']

Getting rid of full stops becomes very easy: just remove the final dot if there's one:

>>> print([s[:-1] for s in sentences if s.endswith(".") else s])            
['The temperature today is 30.8 degrees celsius', 
 'However yesterday at 12:00 A.M., M. John said it was 27.1 degrees']

Hope that helps.

arnaud
  • 3,293
  • 1
  • 10
  • 27