0

My data is of the form :

1 440:0.033906222568727 730:0.0424739279722748 1523:0.0773048148348295 1893:0.0433930684646909

1 271:0.0646290650479301 405:0.0653366028581683 584:0.0744087075001463 770:0.0717824200677465

1 577:0.0679078686536282 761:0.0506946081073312

-1 440:0.0437614564467411 798:0.0370070258333617 831:0.0549176430011721 1681:0.0715035548706038 1963:0.102891965918849 2667:0.0461603813033019 2899:0.0672807783934756

I want output in the form of a table:

1 440 0.033906222568727 ......
1 271 0.0646290650479301 ...... 
1 271 0.0646290650479301 ......
1 577 0.0679078686536282 .........

I have tried using

 x = pd.read_csv('rcv1_train.binary', sep = "\s+|:",  engine = 'python')

and got an error:

pandas.errors.ParserError: Expected 413 fields in line 134, saw 419. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.

DYZ
  • 55,249
  • 10
  • 64
  • 93
  • 1
    "_I have tried using_" - and what happened? – DYZ Apr 08 '18 at 06:12
  • I got an error: pandas.errors.ParserError: Expected 413 fields in line 134, saw 419. Error could possibly be due to quotes being ignored when a multi-char delimiter is used. – Anand Mooga Apr 08 '18 at 06:17
  • Possible duplicate of [Handling Variable Number of Columns with Pandas - Python](https://stackoverflow.com/questions/15242746/handling-variable-number-of-columns-with-pandas-python) – DYZ Apr 08 '18 at 06:48

1 Answers1

1

You probably have bad data in line 134

try using error_bad_lines=False .

x = pd.read_csv('rcv1_train.binary', sep = "\s+|:",  engine = 'python', error_bad_lines=False)
Rakesh
  • 81,458
  • 17
  • 76
  • 113
  • This is giving me all Nan values, and makes me skip almost half of the rows – Anand Mooga Apr 08 '18 at 06:27
  • 1
    That is because the data in the CSV is incorrect. rows with incorrect data will be filled with Nan vals – Rakesh Apr 08 '18 at 06:32
  • @Rakesh "If False, then these "bad lines" will [be] dropped from the DataFrame that is returned." The "bad" lines will _not_ be filled with Nans. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html – DYZ Apr 08 '18 at 06:44