0

I have a .csv file with 2741 rows and 279 columns : csv_intro csv file

When I tried loading that file in python using pd.read_csv() this is what I am getting :

>>> df = pd.read_csv("preprocessed_data.csv")
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2882: DtypeWarning: Columns (1,2,3) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)

>>> df.shape
(18696, 279)

Clearly number of rows has gone from 2741 to 18696 which is absurd.

So I checked duplicate values as following:

>>> df[df.duplicated()].shape
(15987, 279)

Which means out of those 18696 rows, 15987 rows have duplicates present. So why these duplicates are coming after loading that csv file and how to resolve this?

PATHIK GHUGARE
  • 137
  • 1
  • 9
  • Are these duplicated rows at the end? Maybe you have content there in the input which you just don't see? – Timus Apr 01 '22 at 09:13
  • 2
    Try to open the .csv file with a text parser like notepad or vs code and check the number of lines. My guess is that are a bunch of empty (comma separated) lines after line 2742 – braml1 Apr 01 '22 at 10:09
  • how did you create these files? Maybe it was appending new lines instead of removing previous content. – furas Apr 01 '22 at 11:08
  • No @Timus, first duplicate occurred at row no. 7 – PATHIK GHUGARE Apr 01 '22 at 14:32
  • I opened that .csv file in VScode @braml1 and lot of duplicates are present there as well. In the columns such as `Work` and `About` text present contains some commas so that could be the reason for this issue? – PATHIK GHUGARE Apr 01 '22 at 14:36
  • @furas I had a panda's dataframe with those values shown in the screenshot, so I had saved it into a .csv file using `.to_csv()` method – PATHIK GHUGARE Apr 01 '22 at 14:38
  • but how did you create it first. Did you use mode `append` when you save in file? If you would run `to_csv()` many times with mode `append` then it would add the same data many times. OR maybe you created df with duplicated data before saving. So you may have problem with duplicated data because you had mistake in code which creates these files. At this moment you may open every file, use `~` in `df[ ~df.duplicated() ]` to keep only unique values - and later save it back or use to create new dataframe. – furas Apr 01 '22 at 21:26
  • @furas No there were no duplicates before doing `to_csv()` but answers in this [question](https://stackoverflow.com/questions/54217165/pandas-to-csv-converts-str-column-to-intor-float) seems to be working. i.e. issue actually occurred while converting DataFrame into a csv file – PATHIK GHUGARE Apr 02 '22 at 11:11

1 Answers1

1

As for me all problem can be when you create these file - not when you load them.

Maybe you used .to_csv() many times with mode append and it added the same values many times.

At this moment you can use ~ in df[ ~df.duplicated() ] to keep unique values

df = df[ ~df.duplicated() ] 
furas
  • 134,197
  • 12
  • 106
  • 148