I see three possibilities to do that.
- Read the file twice (first to count the fields and second to read it into pandas applying the skiprows method)
- Read it into memory filtering out the invalid lines, then pass it to pandas using
StringIO
- Read it into pandas with all columns (or num desired columns + 1) then only allow the rows where the excess columns contain
NaN
The following examples use variable len_threshold
which should be set to the number of columns allowed for a row and your_file_name
which should contain the name of the csv text file.
Method 1: Read the file twice
You can do that with pandas for convenience. Like this:
# read the rows into one text column
df= pd.read_csv(your_file_name, names=['text'], sep='\n')
# count the separators
counts= df['text'].str.count(',')
# now all rows which have more or less than two separators are skipped
rows_to_skip= counts[counts > len_threshold].index.get_level_values(0).to_list()
pd.read_csv(your_file_name, names=list(range(len_threshold)), index_col=False, skiprows=rows_to_skip)
Note, that to apply this method you should be sure that, your fields do not contain the separator, as it does not check if the commas are inside a quoted text.
Method 2: reding into memory / variant: reading into pandas line by line
string_buffer= io.StringIO()
with open(your_file_name, 'rt') as fp:
at_end= False
i=0
while not at_end:
line= fp.readline()
if line == '':
break
elif line.count(',') <= len_threshold:
string_buffer.write(line)
# "rewind" the string_buffer in order to read it from it's start
string_buffer.seek(0)
df= pd.read_csv(string_buffer, names=list(range(len_threshold)), index_col=False)
Note, that as above to apply this method you should be sure that, your fields do not contain the separator, as it does not check if the commas are inside a quoted text. It needs more memory, so it is not applicable for very large files.
You could also use a variant of this however and instead of writing the correct lines to a string buffer, reading them into pandas using read_csv
. This way you also don't need to worry about type conversions, but pandas could get problems in guessing the types right by just looking at one column. If however you already know the ideal column types, you could of course pass them. The variant would look like this:
df= pd.DataFrame([], columns=range(len_threshold))
df_len=0
string_buffer= io.StringIO()
with open(your_file_name, 'rt') as fp:
at_end= False
i=0
while not at_end:
line= fp.readline()
if line == '':
break
elif line.count(',') <= len_threshold:
tmp_df= pd.read_csv(io.StringIO(line), names=range(len_threshold), index_col=False)
df.loc[df_len]= tmp_df.iloc[0]
df_len+= 1
Method 3: reading into a dataframe then filter out the incorrect lines
This is the simplest method of all.
# read the whole dataframe with all columns
df= pd.read_csv(your_file_name, header=None, index_col=False)
# define an indexer that considers all rows to be good which
# have nothing else in the access rows as `NaN`
if len(df.columns) > len_threshold:
good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns')
df.drop(df[~good_rows].index.get_level_values(0), inplace=True)
df.drop(df.columns[3:], axis='columns', inplace=True)
So this method would probably also allow rows to have excess field separators, as long as the fields are empty. In the version above it would also allow rows to have less than say 3 columns. If for example your third column always contains something in valid rows, it would be easy to exclude rows that are too short. You only would have to change the "good_rows" line to:
good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns') & ~df.iloc[:, 2].isna()