I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3]
, and a variable number of other columns, Var_col = [a, b, c, d, e]
. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c]
.
When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.
How do I either:
- Ignore these columns and the carriage return contained within? or
- Replace the carriage returns with blanks in the csv?
My current code looks something like this:
df = pd.read_csv(data.csv, dtype=str)
I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D
in data". I don't control the source of the data so can't make the edits to that.