Extracting only words out of a mixed string in Python

Question

I am working on a topic modelling task and have the unknown topics in the following form

 topic = 0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword"

I want a regex.findall() function to return a list containing only the words e.g :

['firstword', 'secondword', 'thirdword', 'fourthword', 'fifthword']

I have tried using the regex functions :

regex.findall(r'\w+', topic)  and 
regex.findall(r'\D\w+', topic)

but none of them can eliminate the numbers properly. Can someone help me find out what I am doing wrong?

@SoumyaChakraborty can you share the actual value of the `topic` string? is it `'0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword"'`? — Iain Shelvington, Jan 05 '20 at 09:40
If I type print(topic) it displays : 0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword" — Soumya C, Jan 05 '20 at 09:43

score 3 · Accepted Answer · answered Jan 05 '20 at 09:46

If topic is the string

topic = '0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword"'

Then the following regex will return what you need

re.findall('"(.*?)"', topic)

It finds all strings that are contained within double-quotes (")

score 1 · Answer 2 · answered Jan 05 '20 at 09:48

You can try in two ways:

The first, and simpler, you iterate over the string and keep only the letters like this:

''.join(letter for letter in topic if letter.isalpha())

Otherwise you can use regular expressions like this:

re.sub('[^a-zA-Z]+', '', topic)

This expression keeps only letters il lower and upper case.

score 1 · Answer 3 · answered Jan 05 '20 at 09:52

I came across this exact problem myself. My solution was:

    import re

    def extract_tokens_from_topic(self, raw_topic):            
        raw_topic_string = raw_topic.__str__() # convert list to string
        return re.findall(r"'(.*?)'", raw_topic_string)

where raw_topic came from raw_topic = lda_model.show_topic(topic_no)

score -1 · Answer 4 · answered Jan 05 '20 at 09:48

-1

Here's one way to do:

>>> import re

>>> topic = "0.2*firstword" + "0.2*secondword" + "0.2*thirdword" + "0.2*fourthword" + "0.2*fifthword"

>>> re.sub(r'[ˆ\d]\W',' ', topic).strip().split()
>>> ['firstword', 'secondword', 'thirdword', 'fourthword', 'fifthword']

answered Jan 05 '20 at 09:48

YOLO

20,181
5
20
40

The probabilities are not inside the double quotes, just the consisting words are, but anyways I got your point. Thanks – Soumya C Jan 05 '20 at 09:57

Extracting only words out of a mixed string in Python

4 Answers4

Linked