1

I'm trying to complete the first task to our assignment:

Get 5 regular emails and 5 advance-­‐fee fraud emails (aka spam). Convert them all into text files and then turn each into an array of words (split may help here). Then use a bunch of regular expressions to search the array of words looking for keywords to classify which files are spam or not. If you want to get fancy you could give each array a spam-­‐score out of 10.

  1. Open HTML page and read file.
  2. Strip script, links etc from file.
  3. Have body/para on its own.
  4. Open text file (file2) & write to it (UTF-8).
  5. Pass content from HTML document (file 1).
  6. Now put the words from text file (file2) into an array and later split.
  7. Go through array finding any words that are considered spam and print message to screen stating if the email is a spam or not.

Here is my code:

require 'nokogiri'
file = File.open("EMAILS/REG/Membership.htm", "r")
doc = Nokogiri::HTML(file)
#What ever is passed from elements to the newFile is being put into the new array however the euro sign doesn't appear correctly
elements = doc.xpath("/html/body//p").text
#puts elements

newFile = File.open("test1.txt", "w")
newFile.write(elements)
newFile.close()


#I want to open the file again and print the lines to the screen
#
array_of_words = {}
puts "\n\tRetrieving test1.txt...\n\n"
File.open("test1.txt", "r:UTF-8").each_line do |line|
    words = line.split(' ')
    words.each do |word|
        puts "#{word}"
        #array_of_words[word] = gets.chomp.split(' ')
    end
end

EDITED: Here I've edited the file, however, I'm unable to retrieve the UTF-8 encoding of the euro sign in the array (see the image).

require 'nokogiri'

doc = Nokogiri::HTML(File.open("EMAILS/REG/Membership.htm", "r:UTF-8"))

#What ever is passed from elements to the newFile is being put into the new 
#array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text
#puts elements

File.write("test1.txt", elements)

puts "\n\tRetrieving test1.txt...\n\n"

#I want to open the file again and print the lines to the screen
#
word_array = Array.new
File.read("test1.txt").each_line do |line|
    line.split(' ').each do |word|
        puts "#{word}"
        word_array << word
    end
end
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
red_bairn
  • 73
  • 1
  • 8

2 Answers2

0

You're making things harder for yourself. You already have the paragraph text in elements so there's no need to read test1.txt after writing to it. Then use String#split without arguments to split on all whitespace.

Max
  • 21,123
  • 5
  • 49
  • 71
0

Because this is an assignment, I'm not going to try to answer how you're supposed to do this; You're supposed to figure it out on your own.

What I will do is show you how you should have written what you've already done, and point you in a direction:

require 'nokogiri'

doc = Nokogiri::HTML(File.read("EMAILS/REG/Membership.htm"))

# What ever is passed from elements to the newFile is being put into the new
# array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text

File.write("test1.txt", elements)

print "\n\tRetrieving test1.txt...\n\n"

# I want to open the file again and print the lines to the screen
word_hash = {}
File.open("test1.txt", "r:UTF-8").each_line do |line|
  line.split(' ').each do |word|
    puts "#{word}"
    #word_hash[word] = gets.chomp.split(' ')
  end
end

Many of Ruby's IO methods, and File's by inheritance, can take advantage of blocks, which automatically close the stream when the block exits. Use that capability as leaving files open throughout the run-time of an app is not good.

array_of_words = {} doesn't define an array, it's a hash.

#array_of_words[word] = gets.chomp.split(' ') wouldn't work because of where gets wants to read from. By default it's STDIN, which would be the console, meaning the keyboard. You've already got word at that point so do something with it.

But think, you're basically creating the basis for a Bayesian Filter. You need to be counting the number of occurrences of words, so merely assigning the word to the hash won't get you what you want to know, you need to know how many times a particular word was seen. Stack Overflow has a lot of questions answered about how to count the number of words found in a string, so search for those.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • I was also looking at frequency of words. I actually have about 3 separate files. Something like this one for frequency of words? http://stackoverflow.com/questions/9675146/how-to-get-words-frequency-in-efficient-way-with-ruby – red_bairn Oct 30 '13 at 14:57
  • Yes. There's a complication in what you're going to need to do, that is too advanced for you at this point, but it has to do with word-tense and synonyms and how you find the roots of words. Counting occurrences of a word can be skewed/fooled by singular/plural spellings so for accuracy you need to be able to resolve those differences. "[WordNet](http://wordnet.princeton.edu/)" would be your friend at that point. – the Tin Man Oct 30 '13 at 15:20
  • "On question 1; "get a few text files" means the simple version of it, like get them by hand (if you want to you can load them in or scrape them); but the simple solution is just to represent them directly in the program as a string and then work on them." Ugh. I just found this comment under the News section of our Online resources from the lecturer... – red_bairn Oct 30 '13 at 15:36
  • "Step one: Read all the instructions." – the Tin Man Oct 30 '13 at 15:50
  • It was an additional note in the News Forum external to the instructions. I wish he'd said it in the original instructions. But I'm still going to work on this rather than have the string inside the file. – red_bairn Oct 30 '13 at 16:07
  • The euro sign shows up correctly in the text file but it doesn't show up when printed to the screen in the word_hash. – red_bairn Oct 30 '13 at 16:21