0

I am downloading an rss file posted as xml, and saving it with the rss extension. I then use the rss module to read it as an rss file. The issue I have is the following:

  • If I create the file (page.rss) with an implicit path and I use just that filename to process it with the rss parsing function, everything is fine (downloaded_file = 'page.rss')

  • If I explicity enter manually the full path into the script (downloaded_file = "E:/Libraries/Documents/Android dev/page.rss"), everything works fine also.

  • But if I "calculate" the value of the absolute path with: downloaded_file = File.join(Dir.pwd, 'page.rss') the rss function fails. The value of the variable is apparently the same ("E:/Libraries/Documents/Android dev/page.rss") but there must be an invisible difference. I would like to be able to use the 'calculated' absolute path. I am sure there is a subtle difference in the way this string is interpreted by the rss function. How can I elucidate it? Thanks for any suggestion.

Here is my script:

require 'rss'   
require 'open-uri'

url = 'http://tutorialspoint.com/android/sampleXML.xml'

downloaded_file = File.join(Dir.pwd, 'page.rss')                 # FAILS

puts "Path = #{downloaded_file}"#=> "E:/Libraries/Documents/Android dev/page.rss"
downloaded_file = 'page.rss'                                     # WORKS
#downloaded_file = "E:/Libraries/Documents/Android dev/page.rss" # WORKS
puts "Used path/filename: #{downloaded_file}"

File.open(downloaded_file, 'wb') do |file|  # Download url content into rss file
  file << open(url).read 
end 

rss = RSS::Parser.parse(downloaded_file, false)  # Read rss from downloaded_file                                 
puts "Title: #{rss.channel.title}"
jen
  • 300
  • 1
  • 3
  • 11
  • What's the output of `File.join(Dir.pwd, 'page.rss') == "E:/Libraries/Documents/Android dev/page.rss"`? – Mark Thomas Jun 05 '14 at 16:44
  • Also, what happens if you rename `Android dev` to `Android_dev`? – Mark Thomas Jun 05 '14 at 16:46
  • Well the output of 'File.join(Dir.pwd, 'page.rss')' is '"E:/Libraries/Documents/Android dev/page.rss"' as I tried to indicate. This is why I have a line printing it to check. – jen Jun 05 '14 at 23:08
  • I will try with Android_dev, or with a directory without spaces. – jen Jun 05 '14 at 23:10
  • Actually I was looking for the output of the boolean equality, in case you were looking at objects that happened to serialize to strings and thus you wouldn't be able to tell by a simple 'puts'. But @kardeiz found your problem, so this point is moot. – Mark Thomas Jun 06 '14 at 00:02

1 Answers1

2

NEW ANSWER

Okay, so your downloaded_file string has been marked as tainted, and the RSS::Parser won't open a tainted file string for some reason (see rss/parser.rb about l. 105 for more details). The solution is to either: untaint the downloaded_file string before you call parse, e.g.:

RSS::Parser.parse(downloaded_file.untaint, false)

or to just open the file for the parser, e.g.:

RSS::Parser.parse(File.open(downloaded_file), false)

I'd never run into this issue before, so thanks! I'd heard of object tainting before, but I never really had any use to look into it. There is a bit more information about it here: What are tainted objects, and when should we untaint them?.

PREVIOUS ANSWER

Dir.pwd is going to change depending on where you call the script from. Unless you are calling the script from E:/Libraries/Documents/Android dev, the filepath will be off.

It's better to build your filepath from the location of your script itself. To do so you can add:

ROOT = File.expand_path('..', __FILE__)
downloaded_file = File.join(ROOT, 'page.rss')
# or just downloaded_file = File.expand_path('../page.rss', __FILE__)
Community
  • 1
  • 1
Jacob Brown
  • 7,221
  • 4
  • 30
  • 50
  • This is a good point to keep in mind, and I will follow your recommendation. In this case though, I call the script from its directory so the filepath is the same. I still have the issue of why rss works with downloaded _file = "E:/Libraries/Documents/Android dev/page.rss"_ and not with _downloaded_file = "File.join(ROOT, 'page.rss')_ which has the same value. – jen Jun 05 '14 at 20:03
  • @jen, I updated my answer. When you set the file string directly, the string was marked as `untainted`, but when you set it dynamically it was getting marked as `tainted`, which the RSS parser doesn't like for some reason... – Jacob Brown Jun 05 '14 at 20:34
  • I was not aware of this property. But I am still a beginner. Both methods work perfectly and resolve my issue. Many thanks. – jen Jun 05 '14 at 23:39