0

I wanted to make program, that will split every word in txt file, and the return list of words but without repetition of any word. I converted my pdf book to txt and then used my program, but it failed totally. I have no idea, what I've done wrong. Here's my code:

def split(file):
    lines = open(file, 'rU').readlines()
    words = []
    word = ''
    for line in lines:
        for letter in line:
            if letter not in [' ', '\n', '.', ',']:
                word += letter
            elif letter in [' ', '\n', '.', ',']:
                if word not in words:
                    words.append(word)
                    word = ''

    words.sort()
    return words


for word in split('AKiss.txt'):
    print(word, end=' ')

I also attached AKiss.txt and original PDF in case it could be useful.

PDF - http://1drv.ms/b/s!AtZrd19H_8oyabhAx-NZvIQD_Ug

TXT - http://1drv.ms/t/s!AtZrd19H_8oyapvBvAo27rNJSwQ

Cœur
  • 37,241
  • 25
  • 195
  • 267
Frank Cold
  • 35
  • 1
  • 7
  • *without repetition*... Why not use set instead of a list? – Mangohero1 Oct 17 '17 at 19:50
  • Can you describe how it's failing? – glibdud Oct 17 '17 at 19:50
  • @glibdud It returns theorethically other words, but there are the same words but with little difference, and what is really strange - they do not exist in file: "Do "Don't "Don'tworry "Don'tworryabout "Dorothy "Dorothy" – Frank Cold Oct 17 '17 at 19:57
  • 1
    This link may also be helpful: https://stackoverflow.com/questions/1059559/split-strings-with-multiple-delimiters – Haochen Wu Oct 17 '17 at 20:00

3 Answers3

1

You may want to do it differently:

def split_file(file):
    all_words = set()
    for ln in open(file, 'rU').readlines():
        words = ln.strip().split()

        dot_split = []
        for w in words:
            dot_split.extend(w.split('.'))
        comma_split = []
        for w in dot_split:
            comma_split.extend(w.split(','))

        all_words = all_words.union(set(comma_split))

    print(sorted(all_words))

split_file('test_file.txt')

or simpler, using regular expressions:

import re

def split_file2(file):
    all_words2 = set()
    for ln in open(file, 'rU').readlines():
        words2 = re.split('[ \t\n\.,]', ln.strip())  # note the escaped '.'!
        all_words2 = all_words2.union(set(words2))
    print(sorted(all_words))

As a side note I would refrain from using split as function name as it hides the very function you may want to use from the standard library / string library.

sophros
  • 14,672
  • 11
  • 46
  • 75
  • I did this like this, but in out put i got empty list. – Frank Cold Oct 17 '17 at 19:58
  • the line `all_words.union(set(words.split('.').split(',')))` should be `all_words = all_words.union(set(words.split('.').split(',')))` for the union to be used as intented – Arunmozhi Oct 17 '17 at 20:33
  • @sophros This code has multiple errors. Tried improving and gave up. – Arunmozhi Oct 17 '17 at 20:41
  • @Arunmozhi - Thank you for the attempt. Indeed there were 2 issues that I now fixed while adding a shorter example with regexes. I am surprised you gave up on so simple code though! – sophros Oct 18 '17 at 05:29
1

You can try this:

import itertools
words = list(set(itertools.chain.from_iterable([[''.join(c for c in b if c.isalpha()) for b in i.strip('\n').split()] for i in open('filename.txt') if i != "\n"])))
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • I worked, but i got the same words but with '?' or with dot. Is there a way, to "eliminate" not only new lines but also question marks, commas etc.? – Frank Cold Oct 17 '17 at 20:04
  • It worked. Thank you very much. You saved my from sitting on lecture and finding 100 words i do not know (english isn't my native language). :D Thanks again. – Frank Cold Oct 17 '17 at 20:17
  • @F_Zimny glad to help! :) – Ajax1234 Oct 17 '17 at 20:17
0

Using the strip() and split() methods should help you here.