0

I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3], and a variable number of other columns, Var_col = [a, b, c, d, e]. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c].

When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.

How do I either:

  1. Ignore these columns and the carriage return contained within? or
  2. Replace the carriage returns with blanks in the csv?

My current code looks something like this:

df = pd.read_csv(data.csv, dtype=str)

I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D in data". I don't control the source of the data so can't make the edits to that.

  • pandas can normally read multiline CSV files that are properly quoted (") with defaults; e.g. `df = pd.read_csv('test.csv')`. Does the CSV file open in Excel correctly? – CodeMonkey Jul 27 '21 at 20:25
  • These lines don’t open properly in excel – Richard Kapustynskyj Jul 27 '21 at 20:26
  • Here is a correctly formatted multiline CSV: `play,text\n Hamlet,"There are more things in Heaven and Earth, Horatio,\n than are dreamt of in your philosophy."` – CodeMonkey Jul 27 '21 at 20:27
  • Tried reading file in chunks? `pd.read_csv(filename, chunksize=chunksize)` – CodeMonkey Jul 27 '21 at 20:31
  • I have. But the bad values in the data still corrupt the data frame. The corruption still happens if know a row that is bad and use skiprows and rowcount to import what should be just a single row but the dataframe ends up with two rows – Richard Kapustynskyj Jul 27 '21 at 20:55
  • Have you tried chunking the file using csv module and fixing the data in chunks (or do you see same problem)? https://stackoverflow.com/questions/4956984/how-do-you-split-reading-a-large-csv-file-into-evenly-sized-chunks-in-python – CodeMonkey Jul 27 '21 at 21:14
  • When using the csv module I get the error: “_csv.Error: line contains NUL” when I run it over these lines – Richard Kapustynskyj Jul 27 '21 at 21:43
  • 1
    Looks like you're going to need to do some data cleansing on the data before operating on it as CSV data. Try to capture 4 rows of CSV data in a test.csv file and try to open it in Excel. Trial and error - make changes until it loads correctly in Excel then see it same file opens in Pandas. Next, programatically make those changes on the text lines from raw unchanged data and try to open in Pandas. – CodeMonkey Jul 27 '21 at 21:49
  • Thank you for your help. Do you have any thoughts on how I extract just four rows of the data given that the data is too big to open in notepad. – Richard Kapustynskyj Jul 27 '21 at 21:52
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/235362/discussion-between-jasonm1-and-richard-kapustynskyj). – CodeMonkey Jul 27 '21 at 22:28

1 Answers1

0

Pandas supports multiline CSV files if the file is properly escaped and quoted. If you cannot read a CSV file in Python using pandas or csv modules nor open it in MS Excel then it's probably a non-compliant "CSV" file.

Recommend to manually edit a sample of the CSV file and get it working so can open with Excel. Then recreate the steps to normalize it programmatically in Python to process the large file.

Use this code to create a sample CSV file copying first ~100 lines into a new file.

with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
    line = csvin.readline()
    count = 0
    while line and count < 100:
        csvout.write(line)
        count += 1
        line = csvin.readline()

Now you have a small test file to work with. If the original CSV file has millions of rows and "bad" rows are found much later in the file then you need to add some logic to find the "bad" lines.

CodeMonkey
  • 22,825
  • 4
  • 35
  • 75