0

I am working on a topic modelling task and have the unknown topics in the following form

 topic = 0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword"

I want a regex.findall() function to return a list containing only the words e.g :

['firstword', 'secondword', 'thirdword', 'fourthword', 'fifthword']

I have tried using the regex functions :

regex.findall(r'\w+', topic)  and 
regex.findall(r'\D\w+', topic)

but none of them can eliminate the numbers properly. Can someone help me find out what I am doing wrong?

Ajith
  • 1,447
  • 2
  • 17
  • 31
Soumya C
  • 117
  • 10

4 Answers4

3

If topic is the string

topic = '0.2*"firstword" + 0.2*"secondword" + 0.2*"thirdword" + 0.2*"fourthword" + 0.2*"fifthword"'

Then the following regex will return what you need

re.findall('"(.*?)"', topic)

It finds all strings that are contained within double-quotes (")

Iain Shelvington
  • 31,030
  • 3
  • 31
  • 50
1

You can try in two ways:

The first, and simpler, you iterate over the string and keep only the letters like this:

''.join(letter for letter in topic if letter.isalpha())

Otherwise you can use regular expressions like this:

re.sub('[^a-zA-Z]+', '', topic)

This expression keeps only letters il lower and upper case.

1

I came across this exact problem myself. My solution was:

    import re

    def extract_tokens_from_topic(self, raw_topic):            
        raw_topic_string = raw_topic.__str__() # convert list to string
        return re.findall(r"'(.*?)'", raw_topic_string)

where raw_topic came from raw_topic = lda_model.show_topic(topic_no)

can
  • 444
  • 6
  • 14
-1

Here's one way to do:

>>> import re

>>> topic = "0.2*firstword" + "0.2*secondword" + "0.2*thirdword" + "0.2*fourthword" + "0.2*fifthword"

>>> re.sub(r'[ˆ\d]\W',' ', topic).strip().split()
>>> ['firstword', 'secondword', 'thirdword', 'fourthword', 'fifthword']
YOLO
  • 20,181
  • 5
  • 20
  • 40
  • The probabilities are not inside the double quotes, just the consisting words are, but anyways I got your point. Thanks – Soumya C Jan 05 '20 at 09:57