0

I have tried to encode the Unicode characters in the file that I am passing into the pandas dataframe. But the number of unique row counts with df.column.value_counts() that I am getting in Jupyter notebook is not matching the excel row counts of the same file(after removing duplicate values).

How do I fix the issue?

I have loaded a text file(tab separated) and converted that into a pandas dataframe using encoding = 'ISO-8859-1'. The dataframe was created with unique row counts as 66370 for one of the columns.

When I applied 'Remove duplicates' on the desired column on the original csv file(I was using MS Excel to read the export file), the number of unique values = 66368.

There is a difference of 2 in these 2 files- the pandas dataframe in Jupyter Notebook - pandas unique row counts(66370) and the excel version of the row counts(66368).

I understand this could be an encoding issue but I am not able to fix the same.

Can anyone help please?

df = pd.read_csv('csv_file.csv', encoding= 'ISO-8859-1')
df.column1.value_counts()

I am expecting equal results in the excel version of unique row_counts and df.column1.value_counts().

Actual results are showing a difference of 2 in the row counts by these 2 methods.

  • Hi. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Jan 21 '19 at 07:27
  • I would recommend that you add a unique column (could be just serial numbers) to both and use that to check *which rows* are missing - then you can see how Pandas reads it and see where the difference is coming from – Josh Friedlander Jan 21 '19 at 08:32
  • The fact that you're trying to read a `csv` file in excel and then compare it with your pandas output is incorrect. You need to open the `rounds2.csv` file in a text editor like `Notepad++` or `Ultraedit` to verify the actual line count (not columns). – ParvBanks Jan 21 '19 at 09:11
  • Excel treats each line of a csv file as a new row, which might not be the case and the actual rows might be lesser than what's excel is showing you. – ParvBanks Jan 21 '19 at 09:12
  • @ParvBanks: Beware: line count and row count can be quite different in a csv file: fields can contain new lines which adds lines to the text file. – Serge Ballesta Jan 21 '19 at 09:30
  • Can some fields spread on multiple lines? AFAIK, Excel consideres quote enclosed fields while you did not ask for it in your `read_csv` call. – Serge Ballesta Jan 21 '19 at 09:39
  • @jezrael- In case the question is not clear, let me know and I will try clarifying. I am not sure how else I can ask, given I have written the code and my intention of what I want to achieve, and also what is the problem.. Since there is no specific error that I am getting, I can't post the error which could be easily understood. I can't post the dataset here either which people could refer to and identify the issue, since stack overflow doesn't provide that option. – Prashant Mishra Jan 21 '19 at 15:22
  • I got the right answer after a marathon struggle.. and guess what it was a very small change- the encoder ! :) – Prashant Mishra Jan 21 '19 at 16:26

3 Answers3

0

-It might happen that you are reading the header column as well and do note that pandas starts indexing with Zero. Could you please retry with below and let me know the result

df = pd.read_csv('rounds2.csv', encoding= 'ISO-8859-1')
print(len(df.column1.unique()))
print(df.shape)

Please let me know outputs of both , also you you try to open the file in notepad++ and reconcile the numbers.

let me know your output and then i will edit my answer accordingly

Puneet Sinha
  • 1,041
  • 1
  • 10
  • 23
0

Ok Guys. I have found the answer finally !! After about 6 hours of struggle, I finally figured out the right encoding technique-- the right encoder for my problem was 'ANSI'

so the only change to my code was the encoder below:

df = pd.read_csv('csv_file.csv', encoding= 'mbcs')

I found the answer by going through this link: Get encoding of a file in Windows

The right encoder is here: https://docs.python.org/3/library/codecs.html#standard-encodings

0

This problem could also arise when the content of some rows have hidden \ns in them. So editors like vim shows them on different lines but they are actually a single row as per the dataframe.

A.R.K.S
  • 1,692
  • 5
  • 18
  • 40