0

I have a large CSV with a large number of columns. I am trying to count the number of lines using

File.open(file).readlines.to_a.compact.count.to_i

It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?

Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
user1719747
  • 325
  • 1
  • 6
  • 12
  • What is the source of the wrapped line? (Also: why `to_a`, as `readlines` already returns an `Array`, and why `compact`, since `readlines` can't include a `nil`, and why `to_i`, as `count` always returns a `Fixnum`) – Amadan Dec 01 '14 at 07:01
  • Why do you care about the number of *lines*? Aren't the number of *records* more meaningful? And if it is records you want then why don't you run it through a CSV parser to count the records? – mu is too short Dec 01 '14 at 07:11
  • You need to show a minimal example of the file including the line in question. See [How to create a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). – the Tin Man Dec 01 '14 at 17:48

1 Answers1

0

Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?

You need to show an example of the incoming data if you want us to help beyond generic answers.

To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.

It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.

Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.


Looking at your code:

There are a number of ways to count the number of lines in a file, including the easiest which is:

`wc -l /path/to/file`.to_i 

if you're using *nix.

Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.

  • readlines returns an array.
  • to_a returns an array.

Why turn the array into an array?

readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.

See "Why is "slurping" a file not a good practice?" for more information.

compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.

  • count returns an integer.
  • to_i converts the receiver to an integer.

In other words, to_i is turning an integer into an integer. Why?

If you want to do it in Ruby instead of using wc -l, do something simple and fast:

lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }

After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303