Interoperability issues in reading CSV files between Excel, R, and Python

Question

I am working on a project that requires me to independently verify the results of my Python code using another person's R code and a third person's Excel spreadsheet. (Yeesh.) Python was able to read the Excel file containing the source data, but the R script required a CSV and was not robust to parsing CSVs generated either by Excel or by pandas, failing silently both times.

I generated the CSV in Excel by using the Save As... function. In pandas, I used the following commands:

inFile = 'data-parsed-for-R.xlsx'
Data = pd.read_excel(inFile)

for site, subframe in Data.groupby('Site'):
    outFile = site+'.csv'
    outPath = os.path.join(workingDirectory, site, 'Data')
    if not os.path.exists(outPath):
        os.makedirs(outPath)
    subframe.to_csv(path_or_buf=outPath+'\\'+outFile, na_rep='NA', header=['Foo', 'Bar', 'Baz', 'Qux', 'Quux', \
                                                                           'Quuux','Quuuux'], index=False)

I have already solved the issue - see solution below; I'm submitting the question so that other people will hopefully be able to find it, and possibly comment on other interoperability issues they've discovered.

Does this fix the issue in R? https://stackoverflow.com/a/26347303/6851825 — Jon Spring, Sep 15 '21 at 19:07
I appreciate you posting a solution to your own problem to share knowledge. However, there are countless possible ways CSV files could be generated by programs resulting in minor incompatibility issues. Currently, the question is presented very generally, but the answer details basically 2 problems, an index column and date formatting. I'd suggest editing the question so that it is more searchable for people having those particular problems, as people having those problems are the ones that might be helped by your answer. — Gregor Thomas, Sep 15 '21 at 19:36
@GregorThomas Thank you for the feedback - do you have any specific suggestions for how I can edit the question to make it more searchable? — Elizabeth Rosenzweig, Sep 20 '21 at 19:19
Instead of "interoperabiliity issues", perhaps "BOM and datetime interoperability issue" in your question title, or something like that. Your answer identifies and solves several specific issues that aren't mentioned at all in your question--so mention those issues. — Gregor Thomas, Sep 20 '21 at 19:40
It might also be worth including the R code that reads in the CSV - your answer seems specific to that piece of R code. And that could also allow additional answers suggesting R-based solutions as well. Since you already have a working answer, it probably won't be a priority for anyone to add more answers, but it would paint a more complete picture of the problem. — Gregor Thomas, Sep 20 '21 at 19:42

score 0 · Answer 1 · answered Sep 15 '21 at 19:03

Reading Excel-generated CSV files in R

Using Python's open function, I found that there was a set of whitespace characters ï»¿ at the beginning of an Excel-created CSV that don't display when the file is opened either in Excel or in any Notepad-type application. The ï»¿ is a byte order mark (BOM) corresponding to Windows's utf-8 code, EF BB BF. (This issue is documented in much greater detail here: Weird characters added to first column name after reading a toad-exported csv file .) My solution: don't use Excel to generate CSV input files for R. My other, less preferable solution:

import csv

with open('ExcelMade.csv') as fo:
    debug = fo.readlines()

debug[0] = debug[0][3:]

with open('ExcelMadeFixed.csv', 'w') as fo:
    write = csv.writer(fo)
    write.writerows(debug)

Reading Pandas-generated CSV files in R

There were three issues I identified and solved here. Two had to do with indexes being added by default, the third had to do with datetime formatting.

index = False. This option should be specified in DataFrame.to_csv() when writing out the CSV file. I caught that in my original code, but I wanted to document it in case someone else missed it the first time.

index_col=(0). It turned out that my previous Python script for cleaning and normalizing the data had automatically added an index column to the Excel output. This doesn't matter to pandas, but it does matter to R. Check your input file and add this option to pd.read_excel() if there is an index present in column A.

Datetime formatting. By default, Python represents datetime-type data as YYYY-MM-DD HH:MM:SS. The R script I was provided required the datetimes as MM/DD/YY. This is also a simple fix; there is a date_format option in to_csv() that you can use.

My final code ended up looking like this:

inFile = 'data-parsed-for-R.xlsx'
Data = pd.read_excel(inFile, index_col=(0))

for site, subframe in Data.groupby('Site'):
    outFile = site+'.csv'
    outPath = os.path.join(workingDirectory, site, 'Data')
    if not os.path.exists(outPath):
        os.makedirs(outPath)
    subframe.to_csv(path_or_buf=outPath+'\\'+outFile, na_rep='NA', header=['Foo', 'Bar', 'Baz', \
                                                                           'Qux', 'Quux', 'Quuux', \
                                                                           'Quuuux'], index=False, \
                                                                           date_format='%m/%d/%Y')

It's worth noting that R's defaults also prefer datetime as `YYYY-MM-DD HH:MM:SS`. It seems the R script you were provided overrode those defaults to expect `MM/DD/YY`, but that is a characteristic of your R script, not of R generally. Also R's functions for reading in data usually have an argument where you can specify whether or not there is an index column. — Gregor Thomas, Sep 15 '21 at 19:33

Interoperability issues in reading CSV files between Excel, R, and Python

1 Answers1