Fix BOM issues when reading UTF-8 encoded CSVs with VBA (ï»¿)

Question

I'd like to get fresh advice on the infamous issue caused by the Byte Order Mark (ï»¿ or EF BB BF in hex) when trying to read UTF-8 encoded CSVs with VBA (Excel). Note that I'd like to avoid opening the CSV with Workbooks.Open or FileSystemObject. Actually, I'd rather use an adodb.RecordSet as I need to execute some kind of SQL queries.

After having read many (many!) things, I believe the 4 best solutions to deal with this specific issue are:

Removing the BOM before reading the CSV with ADODB.Connection / ADODB.RecordSet (for instance, via #iFile or Scripting.FileSystemObject-OpenAsTextStream to efficiently read the first line of the file and remove the BOM).
Creating a schema.ini file so that ADO properly parses the CSV.
Using some modules created by wizards (like W. Garcia's class module).
Using an ADODB.Stream and setting Charset = "UTF-8".

The last solution (using a stream) seems quite fine but doing the following returns a string:

Sub loadCsv()

    Const adModeReadWrite As Integer = 3

    With CreateObject("ADODB.Stream")
        .Charset = "utf-8"
        .Mode = adModeReadWrite
        .Open
        .LoadFromFile ("C:\atestpath\test.csv")
        Debug.Print .readtext
    End With
 
End Sub

Do you know any trick that could help using the string returned by .readtext as the Data Source of an ADODB.RecordSet or ADODB.Connection (apart from looping to manually populate the fields of my recordset)?

Does this answer your question? [How to query UTF-8 encoded CSV-files with VBA in Excel 2010?](https://stackoverflow.com/q/33820866/11683) — GSerg, Oct 28 '20 at 14:43
@GSerg No, unfortunately "CharacterSet=65001" does not do any good on my CSV. Also, i'd like to avoid creating a sheet. — Lo Bellin, Oct 28 '20 at 14:51
You don't need to create a sheet. You wanted to query the data into a recordset, it does exactly that. — GSerg, Oct 28 '20 at 14:55
Using `CharacterSet=65001` you still end up with `?` in front of the first field name. — Profex, Oct 28 '20 at 15:36

Profex · Answer 1 · 2020-10-28T17:59:00.657

So, looking into it more, it looks like even if you specify CharacterSet=65001, either in the connection string or the Schema.ini you can't really get rid of the ? in front of the first field.

You can get rid of it if you specify all of the columns in the Schema.ini; but that would still require you create the Schema.ini for every file. You would have to know the field names up front, whether that be because they are always the same, or by reading the field names (running in circles here).

It looks like any solution will have you pre-process the file, ...

So the question is, does it really matter? ...NO, it doesn't seem like it does

In fact, even though the first field name has a ? in front of it, it doesn't look like it actually matters.

Sub ReadCSVasRecordSet()
Const adOpenStatic = 3
Const adLockOptimistic = 3
Const adCmdText = &H1
Dim FilePath As String, Filename As String
Dim Conn As ADODB.Connection
Dim RS As ADODB.Recordset
    FilePath = "C:\temp"
    Set Conn = New ADODB.Connection
    'Conn.Open "Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" & FilePath & ";Extended Properties=""text;CharacterSet=utf-8;HDR=YES;FMT=Delimited"""
    Conn.Open "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" & FilePath & ";Extended Properties=""text;HDR=YES;FMT=Delimited"""
    Filename = "CN43N-Projects.csv"
    Set RS = New ADODB.Recordset
    RS.Open "SELECT * FROM [" & Filename & "] WHERE [Status] = ""REL"" AND [Lev] = 1", Conn, adOpenStatic, adLockOptimistic, adCmdText
    'Checking the first field name
    Debug.Print RS.Fields(0).Name       ' Outputs: ?Lev
    Debug.Print RS.Fields("Lev").Name   ' Outputs: ?Lev
    'Debug.Print RS.Fields("?Lev").Name ' Errors out if I include ?
    Do Until RS.EOF
        Debug.Print RS.Fields.Item("Lev"),
        Debug.Print RS.Fields.Item("Proj# def#"),
        Debug.Print RS.Fields.Item("Name"),
        Debug.Print RS.Fields.Item("Status")
        RS.MoveNext
    Loop
    Set RS = Nothing
    If Not Conn Is Nothing Then
        Conn.Close
        Set Conn = Nothing
    End If
End Sub

Edit 1 - What??

Interestingly, if you want to clean the field name, you can't match the first character with "?" directly because it's still UTF-8. You can either check the ASCII code values

Asc(Left(Fields(0).Name, 1)) = Asc("?");

Or better yet use AscW. You'll notice when you use UTF-8 formatting, you'll end up with

AscW(Left(Fields(0).Name, 1)) = -257 (not 63).

Function CleanFieldName(Fields As ADODB.Fields, Item As Variant) As String
    CleanFieldName = Fields(Item).Name
    ' Comparing against "?" doesn't Work..
    'If Left(CleanFieldName, 1) = "?" And Fields(0).Name = Fields(Item).Name Then CleanFieldName = Mid(CleanFieldName, 2)
    If AscW(Left(CleanFieldName, 1)) = -257 And Fields(0).Name = Fields(Item).Name Then CleanFieldName = Mid(CleanFieldName, 2)
End Function

Unfortunately this does not work. The names of the fields are not detected (actually, the name of the first item in the fields is the BOM, aka ï»¿ or just ? if using CharacterSet=65001) — Lo Bellin, Oct 29 '20 at 08:46
If you don't have headers, then maybe use the Schema.ini, since you must know the columns then. You can always clean the first columns data by removing the first character if `AscW() = -257`. — Profex, Oct 29 '20 at 14:43

Lo Bellin · Accepted Answer · 2020-11-02T13:42:54.257

EDIT: I found that loading the CSV with a querytable object (see this good example) or through a WorkbookQuery object (introduced in Excel 2016) are the easiest and probably most reliable ways to proceed (see an example from the documentation here).

OLD ANSWER:

Talking with @Profex encouraged me to further investigate the issue. Turns out there are 2 problems: the BOM and the delimiter used for the CSV. The ADO connection string I need to use is :

strCon = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Users\test\;Extended Properties='text;HDR=YES;CharacterSet=65001;FMT=Delimited(;)'"

But FMT does not work with a semicolon (FMT=Delimited(;)), at least with Microsoft.ACE.OLEDB.12.0 on a x64 system (Excel x64). Thus, @Profex was quite right to state:

even though the first field name has a ? in front of it, it doesn't look like it actually matters

given that he was using FMT=Delimited on a CSV delimited by a simple comma (",").

Some people suggest to edit the registry so that the semicolon delimiter is accepted. I'd like to avoid that. Also, I'd rather not create a schema.ini file (even if that may be the best solution for complex CSVs). Thus, the only solutions remaining require to edit the CSV before creating the ADODB.Connection.

I know my CSV will always have the problematical BOM as well as the same basic structure (something like "date";"count"). Thus I decided to go with this code:

Dim arrByte() As Byte
Dim strFilename As String
Dim iFile As Integer
Dim strBuffer As String
strFilename = "C:\Users\test\t1.csv"
If Dir(strFilename) <> "" Then 'check if the file exists, because if not, it would be created when it is opened for Binary mode.
    iFile = FreeFile
    Open strFilename For Binary Access Read Write As #iFile
    strBuffer = String(3, " ") 'We know the BOM has a length of 3
    Get #iFile, , strBuffer
    If strBuffer = "ï»¿" 'Check if the BOM is there
        strBuffer = String(LOF(iFile) - 3, " ")
        Get #iFile, , strBuffer 'the current read position is ok because we already used a Get. We store the whole content of the file without the BOM in strBuffer
        arrByte = Replace(strBuffer, ";", ",") 'We replace every semicolon by a colon
        Put #iFile, 1, arrByte
    End If
    Close #iFile
End If

(note: one might use arrByte = StrConv(Replace(strBuffer, ";", ","), vbFromUnicode) because the bytes array is in ANSI format).

Fix BOM issues when reading UTF-8 encoded CSVs with VBA (ï»¿)

2 Answers2

Linked