0

I'm running a simple code to just sort a file by ascending numbers. The files I am working with are about 65 GB csv files.

So when I use the linux command which counts the lines in the csv file wc -l claims.csv (the file before the sort) I get:
143955892 claims.csv

After I run my code down below I produce the file claims_v2.csv.

So my problem is that it is creating more lines after I run the sort.

When I run wc -l claims_v2.csv I get:
143957232 claims_v2.csv

Why is my sorting code creating 1340 more lines than the original?

I took a look at this: This

Could it be the error_bad_lines=False thats causing this error?

import pandas as pd
import numpy as np
inputpath = '../Inputs/'
outputpath = '../Outputs/'

dtype_claim = { 'patent_id':'str',
                'sequence':'object', 
                'text':'str', 
               }
def runsort():
    print('Running claims_v2.csv')
    columns = ['patent_id', 'sequence', 'text']
    df = pd.read_csv(inputpath + 'claims.csv', dtype=dtype_claim, usecols=columns, encoding='utf-8', 
                        engine='python', error_bad_lines=False)
    df['sequence'] = pd.to_numeric(df['sequence'], errors='coerce')
    df['sequence'] = df['sequence'].fillna(-1)
    df['sequence'] = df['sequence'].astype('int64')
    df = df.sort_values(by=['patent_id', 'sequence'], ascending = (True, True))

    print('Exporting to CSV')
    df.to_csv(outputpath + 'claims_v2.csv', index = False)



runsort()
Barmar
  • 741,623
  • 53
  • 500
  • 612

1 Answers1

2

Consider adding this line throughout the code to determine how many lines are being read and which line is causing it grow:

print( f"df has {len(df)} lines." )

That said, I suspect the carriage return (CR) character is at play here. A quick test showed me that pd.read_csv() will consider the CR a newline, but wc -l will not.

So, on second thought, you may want to review your claims.csv for carriage returns. You may consider using vi in binary mode:

vi -b claims.csv

You can find a carriage return in vi using the following keystrokes:

/{Ctrl-V}{Ctrl-M}{Enter}

You can find subsequent matches by pressing n. Sorry for the vi lesson, I'm sure you already know how to find these characters using your favorite editor.

Mark
  • 4,249
  • 1
  • 18
  • 27
  • Thanks! Yeh the wc -l was returning a lot more lines than len(df). When I did the print to compare lengths it returned the exact same lengths. I couldnt open the file in vi or vim because it would crash, i think the file is to big. I should have done the len(df), I didnt know that wc -l wont count carriage returns. – theEconCsEngineer Jul 29 '20 at 20:12