Problem with TXT file extraction in ruby

Question

I have data file as in format of TXT , I like to parse the URL field from TXT file using the below ruby code

f = File.open(txt_file, "r")
f.each_line { |line|
  rows = line.split(',')
  rows[3].each do |url|
    next if url=="URL"
    puts url
  end
}

TXT contains:

name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"

output:

Why does the output come from the option field "0,0,0,0,0,0"? How do I skip this and get the URL field?

Environment ruby 1.8.7 rails 2.3.8 gem 1.3.7

score 2 · Answer 1 · answered Apr 30 '11 at 13:49

I'd check out a CSV parsing tool to make this easier:

 require 'rubygems'
 require 'faster_csv'

 FasterCSV.foreach(txt_file, :quote_char => '"', 
        :col_sep =>',', :row_sep =>:auto) do |row|
   puts row[3] if row[3] != "URL"
   break
 end

Also, I think you're misunderstanding how the split() would work. If you run split() against one row from your file, you're going to get back an array of columns for that single row, not a multidimensional array as rows[3].each would suggest.

In 1.9 it's `require 'csv'` since Faster_CSV replaced the old CSV. — the Tin Man, Apr 30 '11 at 22:40

score 1 · Answer 2 · answered Apr 30 '11 at 13:54

EDIT: Before reading, I completely agree with the answer by Jeff Swensen, I'll leave my answer here regardless.

I'm not entirely sure what your inside loop is for (rows[3].each) Because you can't convert a single line into a 'row' when you only have a single URL. You could split by the ** characters and return an Array of urls but then you still need to remove the extra double quotes, or you could use a Regular Expression, like so:

#!/usr/bin/env ruby

f = DATA
urls = f.readlines.map do |line|
  line[/([^"]+)"\*\*/, 1] 
end
urls.compact!

p urls

__END__
name ,option,price, **URL**
"x", "0,0,0,0,0,0", "123.40",**"http://domain.com/xym.jpg"**
"x", "0,0,0,0,0,0", "111.34",**"http://domain.com/yum.jpg"**

The call to compact is needed because map will insert nil objects when you hit something that doesn't match that expression. For the String#[] method, see here

Hi injekt Thanks, actually the TXT file here act as a MySQL dump It has 20 columns but I want to parse nearly 8 columns value enough instead of all in TXT file. Here I posted my problem faced on URL field alone remaining fields parsed perfectly (stored multiple tables relation). — prabu, May 02 '11 at 09:15

score 1 · Answer 3 · edited May 23 '17 at 12:12

1

The reason that "0" is the result is that your code is blindly splitting on the comma char when you seem to be expecting parsing CSV-style (where column values may contain delimiter chars if the entire column value is enclosed in quotes. I highly suggest using a csv parser. If you are using Ruby 1.9.2, then you will already have access to the FasterCSV library.

edited May 23 '17 at 12:12

Community

1
1

answered Apr 30 '11 at 13:54

buruzaemon

3,847
1
23
44

Thanks buruzaemon and I'm using ruby 1.8.7. – prabu May 02 '11 at 09:38

sawa · Answer 4 · 2011-05-03T00:52:32.223

If you are sure that the fields you want are always surrounded by double quotations, you can use that as the basis for extracting rather than the comma.

File.open(txt_file) do |f|
  f.each_line do |l|
    cols = l.scan(/(?<!\\)"(.*?)(?<!\\)"/)
    cols[3].tap{|url| puts url if url}
  end
end

In your code, the opened IO is not closed. This is a bad practice. It is better to use a block so that you do not forget to close it.
The two (?<!\\)" in the regex match non-escaped double quotations. They use negative lookbehind.
.*? is a non-greedy match, which avoids a match from exceeding a non-escaped double quotation.
tap is to avoid repeating the cols[3] operation twice in puts and if.

Edit again

If you use ruby 1.8.7, you can either

update your regex engine to oniguruma by following easy steps here, http://oniguruma.rubyforge.org/

or

replace the regex. tap cannot be used also. Use the following instead:

.

File.open(txt_file) do |f|
  f.each_line do |l|
    cols = l.scan(/(?:\A|[^\\])"(.*?[^\\]|)"/)
    url = cols[3]
    puts url if url
  end
end

I would recomment using oniguruma. It is a new regex engine introduced since ruby 1.9, and is much powerful and faster than the one used in ruby 1.8. It can be installed easily on ruby 1.8.

Thank you so much for your reply. I miss the end of line in my question i.e "f.close". when i run ur code getting an error "undefined (?...) sequence: /(?<!\\)"(.*?)(?<!\\)"" — prabu, May 02 '11 at 09:30
@prabu I saw you added the information that you use ruby 1.8.7. I updated my answer. — sawa, May 02 '11 at 14:24

score 0 · Answer 5 · answered Apr 30 '11 at 22:32

The data is in CSV format, but if all you want to do is grab the last field in the string, then do just that:

text =<<EOT
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
EOT

require 'pp'
text.lines.map{ |l| l.split(',').last }

If you want to clean up the double-quotes and trailing line-breaks:

text.lines.map{ |l| l.split(',').last.gsub('"', '').chomp }
# => ["URL", "http://domain.com/xym.jpg", "http://domain.com/yum.jpg"]

Problem with TXT file extraction in ruby

5 Answers5