Create dictionary from file

Question

I am doing the count of every word now.

I want to add the count of all of them, which means I need to remove the punctuation after and before the word.

Can someone help please?

to remove all `\n` characters you can use `"".join(txt.split("\n"))` — Leonardo Scotti, Nov 23 '20 at 14:11
you are getting an error because `dt_fr_file("scarlet.txt": str) -> dt[str, int]` isn't valid python! to call the function, simply do `dt_fr_file('scarlet.txt')` — Rafael de Bem, Nov 23 '20 at 14:12
Also if you want to count the number of elements in a list (here the words) you can use the [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter) — ygorg, Nov 23 '20 at 14:14
@LeonardoScotti hi, it didn't work. I need to split the text before the for loop. Did you mean to put the code ```"".join(txt.split("\n"))``` after txt = f.read() and remove the rstrip code? If yes, then it didn't work. — hanamontana, Nov 23 '20 at 14:16
You have to first do what i said and then what you have already done — Leonardo Scotti, Nov 23 '20 at 14:21

tgrandje · Answer 1 · 2020-11-23T15:19:50.930

0

You could use regex and simplify the whole lot :

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = {word:words.count(word) for word in set(words)}
    return words

From the doc :

\W Matches any character which is not a word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_]. If the LOCALE flag is used, matches characters which are neither alphanumeric in the current locale nor the underscore.

+ Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

So \W+ will split on all characters being anything else than a to z, A to Z, 0 to 9 and _. As is suggested in comments, it can be "language" sensitive (non unicode characters, for example). In that case, you can adapt this code to your language by setting

words = re.split('[^a-zA-Z0-9_àéèêùç'])

EDIT To use Stef's suggestion which is indeed faster :

from collections import Counter
def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    words = re.split('\W+')
    words = Counter(words)
    return words

EDIT 2 Without any regex or other libraries, but this not efficient :

def dt_fr_file(file_name):
    with open(file_name) as f:
        txt = f.read()
    split_on = {"'", ","}
    for separator in split_on:
      txt = txt.replace(separator, ' ')
    words = txt.split()
    dict_words = dict()
    for word in set(words):
      if word in dict_words:
        dict_words[word] += dict_words[word] +1
      else
        dict_words[word] = 1
    
    return dict_words

edited Nov 23 '20 at 15:19

answered Nov 23 '20 at 14:17

tgrandje

2,332
11
33

1

Unfortunately, `{word:words.count(word) for word in set(words)}` has a quadratic complexity; more precisely, a complexity proportional to the product of the number of words and the number of distinct words. I suggest using [`collections.Counter`](https://docs.python.org/fr/3/library/collections.html#collections.Counter) instead. – Stef Nov 23 '20 at 14:21
@Stef Good point. The machines are so fast these days, it doesn't make much difference anymore, but I'll edit it – tgrandje Nov 23 '20 at 14:26
2

Also splitting words is no easy task, you can also use a tokenizer from `nltk` for example that accounts for punctuations and languages specificities. `nltk.word_tokenize(my_string)` – ygorg Nov 23 '20 at 14:27
@ygorg : didn't now of this one. I have used other nl tools, but there were setbacks for my mother tongue... I'll have to test this one, thanks ! I'll let my answer stay that way though (NLP opens a another (big) door to the question of tokenization) ; regex is still one "easy" tool to use in this case and will work most of the time. – tgrandje Nov 23 '20 at 14:36
I am SO sorry everyone. I'm using an assigned platform to perform the codes and it doesn't have the "translate" attribute to use. – hanamontana Nov 23 '20 at 14:41
@hanamontana There is no "translate" in this code, only pure regex :-) – tgrandje Nov 23 '20 at 14:46
@tgrandje i'm not allowed to use other methods (those for intermediate Python learner or higher). I used this ```f = open('scarlet2.txt', 'r') txt = f.read() text = txt.lower().strip().replace('"','')``` but I need to remove other punctuation as well so the replace method seems impractical. – hanamontana Nov 23 '20 at 14:48
1

I don't think you can do anything practical rejecting regex and the like (regex and collections ARE standard libraries). I juste added some code if you want to use no library, but this is not really good (you will have to add any punctuation yourself in the code...) – tgrandje Nov 23 '20 at 14:58
I insist that `txt.split()` without any optional arguments to `split` is preferable to `txt.replace(separator, ' ').split(' ')` – Stef Nov 23 '20 at 15:05
@Stef : you're right about split() being preferable to split(' '), I just corrected it. But I'll let the txt.replace(separator, ' ') : if you exclude any "external" libraries, you'll have to manage punctuation... – tgrandje Nov 23 '20 at 15:23

Stef · Answer 2 · 2020-11-23T14:35:27.007

Here are a few suggestions:

Use collections.Counter which was designed specifically for this;
Use .strip() instead of .strip(' ') to strip all whitespace, including tabs and newlines, rather than just spaces;
Remove punctuation according to this answer.

With all this in mind, the code is only two lines long:

import collections
import string

with open('loremipsum.txt', 'r') as f:
  wordcounts = collections.Counter(f.read().lower().translate(str.maketrans('', '', string.punctuation)).split())

print(wordcounts)
# Counter({'eget': 11, 'vitae': 9, 'ut': 9, 'pellentesque': 9, 'sed': 8, 'pretium': 8, 'eu': 8, 'ipsum': 7, 'donec': 7, 'venenatis': 7, 'in': 7, 'lorem': 6, 'sit': 6, 'amet': 6, 'non': 6, 'a': 6, 'enim': 6, 'vestibulum': 6, 'at': 6, 'id': 5, 'et': 5, 'blandit': 5, 'risus': 5, 'tincidunt': 5, 'nibh': 5, 'vulputate': 5, 'ligula': 5, 'quam': 5, 'porttitor': 5, 'lacus': 5, 'vel': 5, 'dolor': 4, 'consectetur': 4, 'elit': 4, 'tortor': 4, 'malesuada': 4, 'mollis': 4, 'sapien': 4, 'est': 4, 'faucibus': 4, 'integer': 4, 'justo': 4, 'tellus': 4, 'quis': 4, 'purus': 4, 'aliquet': 4, 'posuere': 4, 'nisi': 4, 'euismod': 4, 'tempor': 4, 'cras': 4, 'curabitur': 4, 'placerat': 4, 'vehicula': 4, 'nec': 4, 'suscipit': 4, 'augue': 4, 'dapibus': 4, 'finibus': 3, 'efficitur': 3, 'facilisis': 3, 'eros': 3, 'nulla': 3, 'ullamcorper': 3, 'dui': 3, 'nisl': 3, 'eleifend': 3, 'magna': 3, 'consequat': 3, 'arcu': 3, 'sagittis': 3, 'aliquam': 3, 'sem': 3, 'felis': 3, 'condimentum': 3, 'metus': 3, 'phasellus': 3, 'velit': 3, 'mi': 3, 'congue': 3, 'maecenas': 3, 'gravida': 3, 'viverra': 3, 'cursus': 3, 'nullam': 3, 'molestie': 3, 'odio': 3, 'interdum': 3, 'massa': 3, 'libero': 3, 'etiam': 3, 'accumsan': 3, 'porta': 3, 'adipiscing': 2, 'proin': 2, 'lectus': 2, 'rutrum': 2, 'mauris': 2, 'rhoncus': 2, 'feugiat': 2, 'dictum': 2, 'nunc': 2, 'semper': 2, 'per': 2, 'sollicitudin': 2, 'volutpat': 2, 'leo': 2, 'suspendisse': 2, 'nam': 2, 'hendrerit': 2, 'erat': 2, 'ex': 2, 'laoreet': 2, 'ac': 2, 'imperdiet': 2, 'ante': 2, 'lacinia': 2, 'fringilla': 2, 'morbi': 2, 'varius': 1, 'lobortis': 1, 'pulvinar': 1, 'mattis': 1, 'class': 1, 'aptent': 1, 'taciti': 1, 'sociosqu': 1, 'ad': 1, 'litora': 1, 'torquent': 1, 'conubia': 1, 'nostra': 1, 'inceptos': 1, 'himenaeos': 1, 'iaculis': 1, 'luctus': 1, 'dignissim': 1, 'potenti': 1, 'egestas': 1, 'fusce': 1, 'turpis': 1, 'tempus': 1, 'praesent': 1, 'pharetra': 1, 'vivamus': 1, 'ultrices': 1, 'maximus': 1, 'commodo': 1, 'ultricies': 1, 'elementum': 1, 'fames': 1, 'primis': 1, 'tristique': 1, 'diam': 1, 'scelerisque': 1})

If you don't like one-liners, you can split the previous code into several lines:

import collections
import string

with open('loremipsum.txt', 'r') as f:
  text = f.read()
  lowertext = text.lower()
  without_punctuation = lowertext.translate(str.maketrans('', '', string.punctuation))
  words = without_punctuation.split()
  wordcounts = collections.Counter(words)

Finally, an alternative: read the file line by line instead of all at once:

import collections
import string

wordcounts = collections.Counter()
with open('loremipsum.txt', 'r') as f:
  for line in f:
    words = line.lower().translate(str.maketrans('', '', string.punctuation)).split(' ')
    wordcounts.update(collections.Counter(words))

i'm having trouble with removing punctuation. the thread you suggested is from 12 years ago. I tried to follow some codes but they didn't work. — hanamontana, Nov 23 '20 at 14:35
@hanamontana The question is old, but the accepted answer was last edited edited Mar 5 '19. — Stef, Nov 23 '20 at 14:36
@hanamontana See also [yorg's comment about nltk](https://stackoverflow.com/questions/64970028/create-dictionary-from-file/64970201?noredirect=1#comment114863270_64970201) — Stef, Nov 23 '20 at 14:40
I am SO sorry. I'm using an assigned platform to perform the codes and it doesn't have the "translate" attribute to use. — hanamontana, Nov 23 '20 at 14:42

Create dictionary from file

2 Answers2