0

I have a file formatted by lines like this (I know it's a terrible format, I didn't write it):

id: 12345 synset: word1,word2

I want to read the entire file and check to see if every line is correct without having to look line by line.

I've looked into File and Regex, but couldn't find what I need. I tried to use File.read to read the entire file all at once, then use m modifier for regex to check multiple lines, but it's not working the way I anticipated (perhaps it's not what I need).

p.s. Ruby newbie :)

bli00
  • 2,215
  • 2
  • 19
  • 46
  • Did you try using `each_line` as outlined in [this question](http://stackoverflow.com/questions/6012930/how-to-read-lines-of-a-file-in-ruby)? – Steven Schobert Feb 05 '17 at 02:14
  • Please edit to explain in detail the rules for determining if a line is "correct". Must the first two characters be `"id"`, followed by a colon, or could it be any two or two or more letters? Must they be lower case? Does `12345` represent any positive integer (or any integer or any non-negative integer) or must it be a five-digit positive integer or must it be the literal `12345`? Must there be exactly one space between the colon and non-negative integer? And so-on. Please edit and be precise in listing the rules for each line being "correct". – Cary Swoveland Feb 05 '17 at 05:19

2 Answers2

1

Assuming your file always ends with a newline, this should work:

/^(id: \d+ synset: \w+,\w+\n)+$/m

The full ruby:

content = ''
File.open('myfile.txt', 'r') { |f| content = f.read }
puts 'file is valid!' if content =~ /^(id: \d+ synset: \w+,\w+\n)+$/m
eiko
  • 5,110
  • 6
  • 17
  • 35
  • Would this still work for checking that each line matches the format? – Steven Schobert Feb 05 '17 at 02:18
  • sorry about that! `g` is not needed in ruby because =~ will search the entire string by default. i've updated the answer to reflect this. @StevenSchobert this checks to make sure every single line matches the format, but it won't say which do and which don't, it's all or nothing. the regex for the format of an individual line would be `/id: \d+ synset: \w+,\w+/` – eiko Feb 05 '17 at 02:52
  • what if I wanted the words to be in any format separated by commas (may or may not contain specific characters)? e.g. `synset: slash/,-,...` in this case the words are `slash/`, `-`, and `...`. could i do something like `/^(id: \d+ synset: (\w\-\/\.)+)+$/m`? – bli00 Feb 05 '17 at 04:16
  • 1
    and you forgot to delete the `g` in the first line of code. – bli00 Feb 05 '17 at 04:19
  • thestateofmay, the question contained in your last comment is sufficiently different than the present question that it should be posted as a separate question. – Cary Swoveland Feb 05 '17 at 05:33
0

You can use this regex to check each line of the file: ^id:\s*\d+\s+synset:\s*(?:\w+,)*\w+$. You can try the following code, but I don't know any Ruby, I just searched and tested a little. It might work.

line_num = 0
text = File.open('file.txt').read
text.each_line do |line|
    line_num += 1
    if !/^id:\s*\d+\s+synset:\s*(?:\w+,)*\w+$/.match(line) 
        print "Line #{line_num} is incorrect"
    end
end
Nicolas
  • 6,611
  • 3
  • 29
  • 73