12

I have a .text file with following format, where fields (index number, name and message) are separated by \t (tab-separated):

712 ben     Battle of the Books
713 james   i used to be in TOM
714 tomy    i was in BOB once
715 ben Tournaments of Minds
716 tommy    Also the Lion in the upcoming school play
717 tommy   Can you guess
718 tommy    P
...

which I read with read_csv into a data frame:

 chat = pd.read_csv("f.text", sep = "\t", header = None, usecols = [2])

But the data frame just has 9812 rows while the ordinary file has more than 12428 rows (just 21 empty lines). It is quite weird. Do you have any idea? Thanks.

smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    Can you post a download link to your data, difficult to answer here without posting guesses which is counter-productive – EdChum Feb 24 '16 at 09:34
  • Very weird. Maybe is necessary parameter `lineterminator` of [`read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). Or you can try add `index_col=None`.How you check length of `df` ? By `print len(df)` ? – jezrael Feb 24 '16 at 09:43
  • @jezrael just `print df` It will show the row number under the table. Same result with `len(df)` –  Feb 24 '16 at 10:02
  • Hmmm, interesting. If you omit `usecols`, `length` is still wrong? – jezrael Feb 24 '16 at 10:11
  • @jezrael yes. when i print line by line, I got `12428` lines. –  Feb 24 '16 at 11:32
  • 1
    Hmmm, try skip rows like `chat = pd.read_csv("f.text", skiprows=9810, sep = "\t", header = None, usecols = [2])`, then maybe check columns `print df.columns` and index `print df.index` – jezrael Feb 24 '16 at 11:35
  • @jezrael And I got the remaining rows! What happened!? –  Feb 24 '16 at 11:39

1 Answers1

18

I think you need add parameter quoting:

import csv

chat = pd.read_csv("f.text",sep = "\t", header = None, usecols = [2], quoting=csv.QUOTE_NONE)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 4
    jezrael can you actually explain why this works, i.e. why the unquoted read dropped lines? Otherwise it's not a reusable resource to other users. – smci Oct 19 '19 at 05:37
  • 5
    OMG, this saved me! It looks like the default behavior for read_csv() expects everything to be wrapped in quotes. But if it is a tab separated file with no quotes, then you need to specify such, otherwise the data parsing goes awry – axme100 Mar 09 '21 at 00:42