I'm running a simple code to just sort a file by ascending numbers. The files I am working with are about 65 GB csv files.
So when I use the linux command which counts the lines in the csv file wc -l claims.csv
(the file before the sort) I get:
143955892 claims.csv
After I run my code down below I produce the file claims_v2.csv.
So my problem is that it is creating more lines after I run the sort.
When I run wc -l claims_v2.csv
I get:
143957232 claims_v2.csv
Why is my sorting code creating 1340 more lines than the original?
I took a look at this: This
Could it be the error_bad_lines=False thats causing this error?
import pandas as pd
import numpy as np
inputpath = '../Inputs/'
outputpath = '../Outputs/'
dtype_claim = { 'patent_id':'str',
'sequence':'object',
'text':'str',
}
def runsort():
print('Running claims_v2.csv')
columns = ['patent_id', 'sequence', 'text']
df = pd.read_csv(inputpath + 'claims.csv', dtype=dtype_claim, usecols=columns, encoding='utf-8',
engine='python', error_bad_lines=False)
df['sequence'] = pd.to_numeric(df['sequence'], errors='coerce')
df['sequence'] = df['sequence'].fillna(-1)
df['sequence'] = df['sequence'].astype('int64')
df = df.sort_values(by=['patent_id', 'sequence'], ascending = (True, True))
print('Exporting to CSV')
df.to_csv(outputpath + 'claims_v2.csv', index = False)
runsort()