Why is my code that sorts a column to ascending creating more lines?

Question

I'm running a simple code to just sort a file by ascending numbers. The files I am working with are about 65 GB csv files.

So when I use the linux command which counts the lines in the csv file wc -l claims.csv (the file before the sort) I get:
143955892 claims.csv

After I run my code down below I produce the file claims_v2.csv.

So my problem is that it is creating more lines after I run the sort.

When I run wc -l claims_v2.csv I get:
143957232 claims_v2.csv

Why is my sorting code creating 1340 more lines than the original?

I took a look at this: This

Could it be the error_bad_lines=False thats causing this error?

import pandas as pd
import numpy as np
inputpath = '../Inputs/'
outputpath = '../Outputs/'

dtype_claim = { 'patent_id':'str',
                'sequence':'object', 
                'text':'str', 
               }
def runsort():
    print('Running claims_v2.csv')
    columns = ['patent_id', 'sequence', 'text']
    df = pd.read_csv(inputpath + 'claims.csv', dtype=dtype_claim, usecols=columns, encoding='utf-8', 
                        engine='python', error_bad_lines=False)
    df['sequence'] = pd.to_numeric(df['sequence'], errors='coerce')
    df['sequence'] = df['sequence'].fillna(-1)
    df['sequence'] = df['sequence'].astype('int64')
    df = df.sort_values(by=['patent_id', 'sequence'], ascending = (True, True))

    print('Exporting to CSV')
    df.to_csv(outputpath + 'claims_v2.csv', index = False)



runsort()

Are there any commas in the `text` portion that might get mistaken as a split point? For files like this, I like to use some "off" delimiter, like a grave accent (`). — Mark Moretto, Jul 29 '20 at 01:06

score 2 · Accepted Answer · answered Jul 29 '20 at 01:32

Consider adding this line throughout the code to determine how many lines are being read and which line is causing it grow:

print( f"df has {len(df)} lines." )

That said, I suspect the carriage return (CR) character is at play here. A quick test showed me that pd.read_csv() will consider the CR a newline, but wc -l will not.

So, on second thought, you may want to review your claims.csv for carriage returns. You may consider using vi in binary mode:

vi -b claims.csv

You can find a carriage return in vi using the following keystrokes:

/{Ctrl-V}{Ctrl-M}{Enter}

You can find subsequent matches by pressing n. Sorry for the vi lesson, I'm sure you already know how to find these characters using your favorite editor.

Thanks! Yeh the wc -l was returning a lot more lines than len(df). When I did the print to compare lengths it returned the exact same lengths. I couldnt open the file in vi or vim because it would crash, i think the file is to big. I should have done the len(df), I didnt know that wc -l wont count carriage returns. — theEconCsEngineer, Jul 29 '20 at 20:12

Why is my code that sorts a column to ascending creating more lines?

1 Answers1