Get every possible word from a text

Question

I have a text like this:

text = "renoncent au développement. Au lieu de cela,elles s'attaquent à la jugulaire: investir dans un bien immobilier en exploitation qui génère des bénéfices.Avant d'investir, donnée s'est comportée en tant que grand promoteur. Pour déterminer si un projet 'offre potentiel' de profit réaliste,  pesez les antécédents de la et l'équilibre risque récompense potentiel de tout nouveau projet majeur. Souvent, qui cherche une approche intermédiaire formera un partenariat ou une coentreprise avec une entreprise qui est déjà sur le terrain et qui réalise des profits."

i want to have a list from this text that contains every word on the text.

What is your question? This isn't a discussion forum or tutorial. Please take the [tour] and take the time to read [ask] and the other links found on that page. — wwii, May 24 '21 at 18:03
Why would you want to work with Pandas? You do not need it in this case. — Bas, May 24 '21 at 18:11
`words = []` `[words.append(word) for word in text.replace(',','').split(" ")]` `words ` — claudius, May 24 '21 at 18:26

Yeshwanth N · Accepted Answer · 2021-05-24T18:10:08.677

2

You can add it into set so that there wont be any duplicates and remove comma if not required :

words = set()
for word in text.split(" "):
    words.add(word.replace(',',''))
if ',' in words:
    words.remove(',')

edited May 24 '21 at 18:10

answered May 24 '21 at 18:04

Yeshwanth N

570
4
15

Did you test that? Are there any comma's left in the set? – wwii May 24 '21 at 18:05
1

corrected it now ! – Yeshwanth N May 24 '21 at 18:10
thanks, 2 min to accept it :) – chikabala May 24 '21 at 18:13

score 1 · Answer 2 · answered May 24 '21 at 18:06

You can strip ',' while adding the word to list. Also you can use OrderedDict Module to remove duplicates.

text = "Conscious of its spiritual and moral heritage, the Union is founded on the indivisible, universal values of human dignity, freedom, equality and solidarity; it is based on the principles of democracy and the rule of law. It places the individual at the heart of its activities, by establishing the citizenship of the Union and by creating an area of freedom, security and justice."
words = []
from collections import OrderedDict
for word in text.split(" "):
   words.append(word.strip(",")) #=== Remove ',' from word
list1=list(OrderedDict.fromkeys(words)) #=== Remove duplicates
print(list1)

score 1 · Answer 3 · answered May 24 '21 at 18:07

This is not the most efficient, but will work using lists.

text = "Conscious of its spiritual and moral heritage, the Union is founded on the indivisible, universal values of human dignity, freedom, equality and solidarity; it is based on the principles of democracy and the rule of law. It places the individual at the heart of its activities, by establishing the citizenship of the Union and by creating an area of freedom, security and justice."

words = []

def get_unique_words(text):
    # converts all alphabetical characters to lower
    lower_text = text.lower()
    # splits string on space character 
    split_text = lower_text.split(' ')

    # empty list to populate unique words
    results_list = []
    # iterate over the list
    for word in split_text:
        # check to see if value is already in results lists
        if word not in results_list:
            # append the word if it is unique
            results_list.append(word)
    return results_list

results = get_unique_words(text)

print(results)

prints

['conscious', 'of', 'its', 'spiritual', 'and', 'moral', 'heritage,', 'the', 'union', 'is', 'founded', 'on', 'indivisible,', 'universal', 'values', 'human', 'dignity,', 'freedom,', 'equality', 'solidarity;', 'it', 'based', 'principles', 'democracy', 'rule', 'law.', 'places', 'individual', 'at', 'heart', 'activities,', 'by', 'establishing', 'citizenship', 'creating', 'an', 'area', 'security', 'justice.']

Bas · Answer 4 · 2021-05-24T18:16:17.910

1

list(set(text.split(" ")))

And this way the comma's are removed, but it gets a bit unreadable:

list(set(''.join(text.split(",")).split(" ")))

edited May 24 '21 at 18:16

answered May 24 '21 at 18:07

Bas

454
1
6
14

Please provide in code format. – May 24 '21 at 18:08
i added text.split(" ").strip(",") to get rid of commas, thank you – chikabala May 24 '21 at 18:20

Get every possible word from a text

4 Answers4