0

I have a series of massive data files that range in size from 800k to 1.4M rows, and one variable in particular has a set length of 12 characters (numeric data but with leading zeros where other the number of non-zero digits is fewer than 12). The column should look like this:

col
000000000003
000000000102
000000246691
000000000042
102851000324

etc.

I need to export these files for a client to a CSV file, using R. The final data NEEDS to retain the 12 character structure, but when I open the CSV files in excel, the zeros disappear. This happens even after converting the entire data frame to character. The code I am using to do this is as follows.

df1 %>%
mutate(across(everything(), as.character))
##### I did this for all data frames #####


export(df1, "df1.csv")
export(df2, "df2.csv")
....
export(df17, "df17.csv)

I've read a few other posts that say this is an excel problem, and that makes sense, but given the number of data files and amount of data, as well as the need for the client to be able to open it in excel, I need a way to do it on the front end in R. Any ideas?

neilfws
  • 32,751
  • 5
  • 50
  • 63
jtscheirer
  • 13
  • 3
  • 3
    This is an Excel issue, not an R issue. Most likely you need to import the CSV file into a blank Excel workbook using the Data->From Text/CSV dialog and specify text for the column type, rather than just clicking on the CSV file to open it. – neilfws Jan 11 '23 at 23:48
  • See https://stackoverflow.com/questions/165042/stop-excel-from-automatically-converting-certain-text-values-to-dates/165052#165052 for methods to address this. As commented above, this *is* an Excel issue but there are workarounds although there is not enough info to say whether they're suited to your case. – Ritchie Sacramento Jan 12 '23 at 00:05
  • @RitchieSacramento, what additional information would you need to answer this? – jtscheirer Jan 12 '23 at 00:08
  • The easiest approach as indicated in the link is to save the data as a formula containing a string literal. My comment was that we don't know how the data is intended to be used after that and whether it will cause problems down the pipeline. – Ritchie Sacramento Jan 12 '23 at 00:10
  • Could you alternatively save it as an excel file to circumvent Excel's CSV loading bug for leading zeroes? – Jon Spring Jan 12 '23 at 00:11
  • @JonSpring, my sense is that this is an Excel problem. So wouldn't it still persist regardless of whether it is saved as .xlsx or .csv? Either way, the file sizes I'm working with are too big to be efficiently saved as excel files. Unless I'm doing something wrong, I'm able to export .csv files almost instantly, while exporting .xlsx files of this sizes nearly crashes R. – jtscheirer Jan 12 '23 at 17:54
  • FWIW, in Excel I can make a workbook with cells that are formatted text. If I save as xlsx and reopen, the zero padding is preserved. If I save as CSV and reopen, the zero padding is lost. So I believe the problem is that Excel strips leading zeroes when it loads CSVs. You might look at the various packages that save Excel files and see if another works better for you. – Jon Spring Jan 12 '23 at 18:47

1 Answers1

2

Yes, this is definitely an Excel problem!
To demonstrate, In Excel enter your column values save the file as a CSV value and then re-open it in Excel, the leading zeros will disappear.

One option is add a leading non-numerical character such as '

paste0("\' ", df$col)

Not a great but an option.

A slightly better option is to paste Excel's Text function to the character string. Then Excel will process the function when the function is opened.

df$col <- paste0("=Text(", df$col, ", \"000000000000\")")
#or  
df$col <- paste0("=\"", df$col, "\"")
write.csv(df, "df2.csv", row.names = FALSE)

Of course if the CSV file is saved and reopened then the leading 0 will again disappear.

Another option is to investigate saving the file directly as a .xlsx file with the "writexl", or "XLSX" or similar package.

Dave2e
  • 22,192
  • 18
  • 42
  • 50