Python - Splitting words in txt

Question

I wanted to make program, that will split every word in txt file, and the return list of words but without repetition of any word. I converted my pdf book to txt and then used my program, but it failed totally. I have no idea, what I've done wrong. Here's my code:

def split(file):
    lines = open(file, 'rU').readlines()
    words = []
    word = ''
    for line in lines:
        for letter in line:
            if letter not in [' ', '\n', '.', ',']:
                word += letter
            elif letter in [' ', '\n', '.', ',']:
                if word not in words:
                    words.append(word)
                    word = ''

    words.sort()
    return words


for word in split('AKiss.txt'):
    print(word, end=' ')

I also attached AKiss.txt and original PDF in case it could be useful.

PDF - http://1drv.ms/b/s!AtZrd19H_8oyabhAx-NZvIQD_Ug

TXT - http://1drv.ms/t/s!AtZrd19H_8oyapvBvAo27rNJSwQ

@glibdud It returns theorethically other words, but there are the same words but with little difference, and what is really strange - they do not exist in file: "Do "Don't "Don'tworry "Don'tworryabout "Dorothy "Dorothy" — Frank Cold, Oct 17 '17 at 19:57
This link may also be helpful: https://stackoverflow.com/questions/1059559/split-strings-with-multiple-delimiters — Haochen Wu, Oct 17 '17 at 20:00

sophros · Answer 1 · 2017-10-18T05:27:57.450

1

You may want to do it differently:

def split_file(file):
    all_words = set()
    for ln in open(file, 'rU').readlines():
        words = ln.strip().split()

        dot_split = []
        for w in words:
            dot_split.extend(w.split('.'))
        comma_split = []
        for w in dot_split:
            comma_split.extend(w.split(','))

        all_words = all_words.union(set(comma_split))

    print(sorted(all_words))

split_file('test_file.txt')

or simpler, using regular expressions:

import re

def split_file2(file):
    all_words2 = set()
    for ln in open(file, 'rU').readlines():
        words2 = re.split('[ \t\n\.,]', ln.strip())  # note the escaped '.'!
        all_words2 = all_words2.union(set(words2))
    print(sorted(all_words))

As a side note I would refrain from using split as function name as it hides the very function you may want to use from the standard library / string library.

edited Oct 18 '17 at 05:27

answered Oct 17 '17 at 19:50

sophros

14,672
11
46
75

I did this like this, but in out put i got empty list. – Frank Cold Oct 17 '17 at 19:58
the line `all_words.union(set(words.split('.').split(',')))` should be `all_words = all_words.union(set(words.split('.').split(',')))` for the union to be used as intented – Arunmozhi Oct 17 '17 at 20:33
@sophros This code has multiple errors. Tried improving and gave up. – Arunmozhi Oct 17 '17 at 20:41
@Arunmozhi - Thank you for the attempt. Indeed there were 2 issues that I now fixed while adding a shorter example with regexes. I am surprised you gave up on so simple code though! – sophros Oct 18 '17 at 05:29

Ajax1234 · Accepted Answer · 2017-10-17T20:10:48.103

1

You can try this:

import itertools
words = list(set(itertools.chain.from_iterable([[''.join(c for c in b if c.isalpha()) for b in i.strip('\n').split()] for i in open('filename.txt') if i != "\n"])))

edited Oct 17 '17 at 20:10

answered Oct 17 '17 at 19:52

Ajax1234

69,937
8
61
102

I worked, but i got the same words but with '?' or with dot. Is there a way, to "eliminate" not only new lines but also question marks, commas etc.? – Frank Cold Oct 17 '17 at 20:04
It worked. Thank you very much. You saved my from sitting on lecture and finding 100 words i do not know (english isn't my native language). :D Thanks again. – Frank Cold Oct 17 '17 at 20:17
@F_Zimny glad to help! :) – Ajax1234 Oct 17 '17 at 20:17

score 0 · Answer 3 · answered Oct 17 '17 at 19:55

0

Using the strip() and split() methods should help you here.

answered Oct 17 '17 at 19:55

Python - Splitting words in txt

3 Answers3