I'm trying to complete the first task to our assignment:
Get 5 regular emails and 5 advance-‐fee fraud emails (aka spam). Convert them all into text files and then turn each into an array of words (split may help here). Then use a bunch of regular expressions to search the array of words looking for keywords to classify which files are spam or not. If you want to get fancy you could give each array a spam-‐score out of 10.
- Open HTML page and read file.
- Strip script, links etc from file.
- Have body/para on its own.
- Open text file (file2) & write to it (UTF-8).
- Pass content from HTML document (file 1).
- Now put the words from text file (file2) into an array and later split.
- Go through array finding any words that are considered spam and print message to screen stating if the email is a spam or not.
Here is my code:
require 'nokogiri'
file = File.open("EMAILS/REG/Membership.htm", "r")
doc = Nokogiri::HTML(file)
#What ever is passed from elements to the newFile is being put into the new array however the euro sign doesn't appear correctly
elements = doc.xpath("/html/body//p").text
#puts elements
newFile = File.open("test1.txt", "w")
newFile.write(elements)
newFile.close()
#I want to open the file again and print the lines to the screen
#
array_of_words = {}
puts "\n\tRetrieving test1.txt...\n\n"
File.open("test1.txt", "r:UTF-8").each_line do |line|
words = line.split(' ')
words.each do |word|
puts "#{word}"
#array_of_words[word] = gets.chomp.split(' ')
end
end
EDITED: Here I've edited the file, however, I'm unable to retrieve the UTF-8 encoding of the euro sign in the array (see the image).
require 'nokogiri'
doc = Nokogiri::HTML(File.open("EMAILS/REG/Membership.htm", "r:UTF-8"))
#What ever is passed from elements to the newFile is being put into the new
#array however the euro sign doesn't appear correctly
elements = doc.xpath("//p").text
#puts elements
File.write("test1.txt", elements)
puts "\n\tRetrieving test1.txt...\n\n"
#I want to open the file again and print the lines to the screen
#
word_array = Array.new
File.read("test1.txt").each_line do |line|
line.split(' ').each do |word|
puts "#{word}"
word_array << word
end
end