Compression function for words.. python

Question

This is supposed to be a compression function. We're supposed to read in a text file with just words, and sort them by frequencies and by the amount of words. Upper and lower case. I don't an answer to solve it, I just want help.

for each word in the input list
    if the word is new to the list of unique words
        insert it into the list and set its frequency to 1
    otherwise
        increase its frequency by 1

sort unique word list by frequencies (function)

open input file again
open output file with appropriate name based on input filename (.cmp)
write out to the output file all the words in the unique words list

for each line in the file (delimited by newlines only!)
   split the line into words
   for each word in the line
     look up each word in the unique words list
       and write its location out to the output file
     don't output a newline character
   output a newline character after the line is finished

close both files
tell user compression is done

This is my next step:

def compression():

    for line in infile:
        words = line.split()    

def get_file():

    opened = False
    fn = input("enter a filename ")
    while not opened:
        try:
            infile = open(fn, "r")
            opended = True
        except:
            print("Won't open")
            fn = input("enter a filename ")

    return infile

def compression():

    get_file()

    data = infile.readlines()
    infile.close()

    for line in infile:
        words = line.split()

    words = []
    word_frequencies = []

def main():

    input("Do you want to compress or decompress? Enter 'C' or 'D' ")

main()

If you really are trying, show us your code and what specific problems you have. Stack Overflow, as a rule, will not be doing your entire homework for you. — Amadan, Dec 11 '13 at 01:10
Is any of that numbered list part of the problem specification rather than your own thoughts on how to solve this? Also, what version of Python are you using, and do you have any restrictions on using classes from the standard library? I have a simple solution in mind, though as you requested I'll give suggestions for it rather than just the answer. — Peter DeGlopper, Dec 11 '13 at 01:14
did item #1 in the spec really instruct you to use a `list`? A `dict` is the natural choice for counting/tallying operations. — roippi, Dec 11 '13 at 01:16
Maybe the data must be stored in a list, but a dictionary can be created with the keys as said list? — Monkeyanator, Dec 11 '13 at 01:25
Hmm. Between the requirement that you use a list and the references to decompressing - is this supposed to be reversible? A straight frequency count is not reversible, but storing repeated elements as a count can be a compression for some situations. — Peter DeGlopper, Dec 11 '13 at 01:25
I have to compress and decompress for the program, but I'm just worried about the compression part for now. — user3089144, Dec 11 '13 at 01:27
I'm also not allowed to use anything I haven't learned in class, and have never heard of a dictionary. — user3089144, Dec 11 '13 at 01:29
@PeterDeGlopper look at the pseudocode - there is a list of unique words, and an output file with a sequence of list indices for where each word goes. For example, "Foo Bar Baz" would become `["Foo", "Bar", "Baz"]` with outfile `0 1 2` — MattDMo, Dec 11 '13 at 01:35
The pseudocode helps clarify considerably - this looks like a rough outline of Hoffman coding, using words as the atoms and decimal numbers rather than bit sequences as the codes. Weird, but I can halfway see how it could be useful. — Peter DeGlopper, Dec 11 '13 at 02:14
You can't just call split on the opened file. You actually have to read the data from the file, and then call split on that data. — Monkeyanator, Dec 11 '13 at 02:20

score 1 · Answer 1 · answered Dec 11 '13 at 01:14

So I'm not going to do an entire assignment for you, but I can try my best and walk you through it one by one.

It would seem the first step is to create an array of all the words in the text file (assuming you know file reading methods). For this, you should look into Python's split function (regular expressions can be used for more complex variations of this). So you need to store each word somewhere, and pair that value with the amount of times it appears. Sounds like the job of a dictionary, to me. That should get you off on the right track.

Peter DeGlopper · Answer 2 · 2013-12-11T03:45:07.680

1

Thanks to the pseudocode I can more or less figure out what this should look like. I have to do some guesswork since you say you're restricted to what you've covered in class, and we have no way to know what that includes and what it doesn't.

I am assuming you can handle opening the input file and splitting it into words. That's pretty basic stuff - if your class is asking you to handle input files it must have covered it. If not, the main thing to look back to is that you can iterate over a file and get its lines:

with open('example.txt') as infile:
    for line in infile:
        words = line.split()

Now, for each word you need to keep track of two things - the word itself and its frequency. Your problem requires you to use lists to store your information. This means you have to either use two different lists, one storing the words and one storing their frequencies, or use one list that stores two facts for each of its positions. There are disadvantages either way - lists are not the best data structure to use for this, but your problem definition puts the better tools off limits for now.

I would probably use two different lists, one containing the words and one containing the frequency counts for the word in the same position in the word list. The advantage there is that that would allow using the index method on the word list to find the list position of a given word, and then increment the matching frequency count. That will be much faster than searching a list that stores both the word and its frequency count using a for loop. The down side is that sorting is harder - you have to find a way to retain the list position of each frequency when you sort, or to combine the words and their frequencies before sorting. In this approach, you probably need to build a list data that stores two pieces of information - the frequency count and then either the word list index or the word itself - and then sort that list by the frequency count.

I hope that part of the point of this exercise is to help drive home how useful the better data structures are when you're allowed to use them.

edited So the inner loop is going to look something like this:

words = []
word_frequencies = []
for line in infline:
    for word in line.split():
        try:
            word_position = words.index(word)
        except ValueError:
            # word is not in words
            words.append(word)
            # what do you think should happen to word_frequencies here?
        else:
            # now word_position is a number giving the position in both words and word_frequencies for the word
            # what do you think should happen to word_frequences here?

edited Dec 11 '13 at 03:45

answered Dec 11 '13 at 02:29

Peter DeGlopper

36,326
7
90
83

Okay, now that I have the words separated. I think I need to initialize two separate lists. But I'm not sure how to how long they need to be... so like ... wordcount = [0] * len(words) or something like that? – user3089144 Dec 11 '13 at 03:05
I'm not quite sure what you mean by "initialize" - I'd expect to see declarations like `words = []` and `word_frequencies = []` if you're using two matching lists as I alluded to. You don't have to preset your list lengths in Python if you don't know them up front - `append` and so on will work basically indefinitely. I'm not sure offhand exactly where the limit falls on a typical system, but it's very, very high. If you do know them it can be an optimization in some cases but it's not worth worrying about if you don't. – Peter DeGlopper Dec 11 '13 at 03:07
Ah, here we are: http://stackoverflow.com/questions/855191/how-big-can-a-python-array-get - in any case preallocation won't change that, it's just a performance improvement in cases in which it matters at all. Mostly you can just not worry about it. – Peter DeGlopper Dec 11 '13 at 03:14
I've edited to readd some information you removed - I think it's necessary to help the community understand what you're trying to do, but you might want to review to make sure it's accurate. Also, your attempted code is erratic - just for instance, your `get_file()` function returns the opened file, but your second definition of `compression` doesn't capture the returned file handle. That part should be something like `infile = get_file()`. And after you close a file you can't read from it anymore. – Peter DeGlopper Dec 11 '13 at 03:25
Basically, you should initialize your `words` and `word_frequencies` list before reading any lines. You're then going to have a `for` loop over each `word` in each `line` in `infile`, finding the index of that word in `words` if there is one and handling the case where there isn't with `append`. You'll then have an index into `word_frequencies` for a value to update, or the value `index` gives when it can't find a match - in which case you can `append` to `word_frequencies` as well. (Again, trying not to just write a solution.) – Peter DeGlopper Dec 11 '13 at 03:31
ugh.. yeah I don't really understand that. – user3089144 Dec 11 '13 at 03:37
Edited in some skeleton code. On one point I should apologize - I had remembered the `index` function of lists as working more like `find` does on strings, returning `-1` if it can't find a match rather than raising an exception. If you haven't covered exceptions yet, you're probably best off using a single list that stores both the word and its count and searching through it using a `for` loop. – Peter DeGlopper Dec 11 '13 at 03:46
I'm guessing for each word that it finds, the frequency would go up one, and if it finds another word like it, it would keep adding to that frequency. – user3089144 Dec 11 '13 at 03:49
We have covered exceptions. – user3089144 Dec 11 '13 at 03:50
I'm still not sure how to write that... The project is due tonight, and this isn't even the last part. :( – user3089144 Dec 11 '13 at 17:42

Compression function for words.. python

2 Answers2