0

I am processing a csv file and before that I am getting the row count using the below code.

total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))

The code has been written with the help given in this link. However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??

In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.

Any help is appreciated!

Eswar
  • 1,201
  • 19
  • 45
  • 1
    When you have _cells with huge text_, csv parsers such as `pandas.read_csv` will read it correctly as opposed to `open` which reads the file line by line, not considering the _huge text_ as one. – Chris Mar 15 '19 at 08:11
  • @Chris yes. I have found alternate ways to read the rows correctly. But what is wrong with the above code that was suggested for using to count rows?? – Eswar Mar 15 '19 at 08:12
  • 3
    As I said, suppose your _huge text_ is two or more lines. In csv, this is considered as **one cell** which will be inside a single line. Nevertheless, `open` doesn't know it should be considered as one cell, and simply returns count of lines. – Chris Mar 15 '19 at 08:14
  • 1
    How many lines are in your file? Are you sure there is not a mix of `\r` and `\n` that might give your a wrong count? or some `\n` or `\r` within some cell text? – Allan Mar 15 '19 at 08:14
  • I have run your code on a large file and it was providing me the correct output: `python count_row.py 1715181568` same result with `wc -l` – Allan Mar 15 '19 at 08:17
  • @Eswar can you provide us more info related what count you are getting what is apected – rahul singh Mar 15 '19 at 08:17
  • Counting the lines in the file (determined by \n or \r\n) is not the same as counting the csv rows because text in a value can include \n. – DisappointedByUnaccountableMod Mar 15 '19 at 08:19
  • @Eswar go with pandas dataframe read the csv file like df=pd.read_csv("yourFile.csv") and then df.shape it will give all rows with columns number – rahul singh Mar 15 '19 at 08:21
  • @rahulsingh I have the correct way(with and without using pandas) to rows but I want to know why the above code isn't working. Chris pointed out right. – Eswar Mar 15 '19 at 08:59
  • Just wanted to stress the importance to read the file with the correct encoding, as otherwise results could be completely off – elachell Mar 21 '20 at 16:59

2 Answers2

3

Let's assume you have a csv file in which some cell's a multi-line text.

$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"

Which, by look of it, has three lines and wc -l agrees:

$ wc -l example.csv
3 example.csv

And so does open with sum:

sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3

But now if you read is with some csv parser such as pandas.read_csv:

import pandas as pd

df = pd.read_csv('./example.csv')
df
   colA                                    colB
0     1  Hi. This is Line 1.\nAnd this is Line2

The other alternative way to fetch the correct number of rows is given below:

with open(csv_file,"r",encoding="utf-8") as f:
     reader = csv.reader(f,delimiter = ",")
     data = list(reader)
     row_count = len(data)

Excluding the header, the csv contains 1 line, which I believe is what you expect. This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.

Eswar
  • 1,201
  • 19
  • 45
Chris
  • 29,127
  • 3
  • 28
  • 51
1

I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:

1, "my huge text\n with many lines\n"
2, "other text"

Your method for data above will return 4 when accutaly there are only 2 rows

Try to use Pandas or other library for reading CSV files. Example:

import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()

Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.

wdudzik
  • 1,264
  • 15
  • 24
  • I was able to get the correct number of rows without using pandas but are you suggesting that open function is counting the number of lines in each cell as well? – Eswar Mar 15 '19 at 08:56
  • 1
    Yes, because open just reads file. It is not taking into account that this is CSV. – wdudzik Mar 15 '19 at 09:01
  • Chris summed what you're saying. Thank you. – Eswar Mar 15 '19 at 09:02