1

I'm trying to read a dataset and parse that into the data I need. The file will consist of lines of strings like this:

id: 1234567 synset: test,exam

I want to then obtain the id number and the synset word. So in this case, I want 1234567 and test,exam

Here's what I've come up with, but I'm sure there are better ways.

File.open(synsets_file, "r") do |f|
    f.each_line do |line|
       id = line.split[1].to_i
       nouns = line.split[3]
       #do things with id and nouns
    end
end
Yu Hao
  • 119,891
  • 44
  • 235
  • 294
bli00
  • 2,215
  • 2
  • 19
  • 46

4 Answers4

1

Your example is fine. You could use split once with this syntax :

File.foreach(synsets_file) do |line|
  _, id, _, nouns = line.chomp.split(/\s+/, 4)
  # do things with id and nouns
end

Using 4 as second parameter for split will ensure that nouns isn't split if there are spaces inside.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
0

If you will be reading large files it is better to use something like foreach instead of reading the entire file into memory:

File.foreach(sunset_file) do |l|
  id = l.split[1].to_i
  nouns = l.split[3]
  #do things with id and nouns
end

More information can be found in this SO post. The third answer down discusses "slurping" a file and why it's not a good idea.

Edit: Removed JSON portion of answer.

Community
  • 1
  • 1
trueinViso
  • 1,354
  • 3
  • 18
  • 30
0

Use a regular expression

File.open(synsets_file, "r") do |f|
  f.each_line do |line|
      /^id: (?<id>.*) synset: (?<nouns>.*)/ =~ line.chomp

     puts id
     puts nouns

     # ...

  end
end
akuhn
  • 27,477
  • 2
  • 76
  • 91
-1

Try using JSON format in the file, it will be easier for you. Then, you can do something like this:

require 'json'
file = File.read('file-name-to-be-read.json')
data_hash = JSON.parse(file)
puts data_hash['id'] // gives 1234567
hvardhan
  • 460
  • 8
  • 14
  • You can try splitting the lines based on spaces. `line.split(" ")` This way, you will get an array. – hvardhan Feb 04 '17 at 05:45