1

I've got a project that has to do a bulk import to SQL from a CSV file. Creating the data columns has been a success, however, I'm running into a problem with the rows. A comma is used as the delimiter to separate the columns which work great in the column names, but not in the rows of data. Some data has a comma to split name and surnames. Together with this, only every second field(column) is enclosed in double quotes. Using all this is breaking the rows in many more columns than it should. I have suggested changing the delimiter to a semicolon which actually works great and everything works fine, except this is not accepted by the customer as they don't want to change anything.

This is what I've done:

    private static DataTable ImportFordEmailList(string csvFilePath)
    {
        DataTable csvData = new DataTable();

        DataTable dt = new DataTable();
        dt.Columns.Add("ColumnName");
        dt.Rows.Clear();

        try
        {
            using (TextFieldParser csvReader = new TextFieldParser(csvFilePath))
            {
                //  csvReader.TextFieldType = FieldType.Delimited;
                csvReader.SetDelimiters(new string[] { "," });
                csvReader.HasFieldsEnclosedInQuotes = false;
                csvReader.TrimWhiteSpace = true;

                string[] colFields = csvReader.ReadFields();

                foreach (string column in colFields)
                {
                    if (dt.Rows.Count > 0)
                    {
                        string newColumn = Regex.Replace(column, "[^A-Za-z0-9]", "");
                        string findColum = "ColumnName = '" + newColumn.Trim() + "'";
                        DataRow[] foundRows = dt.Select(findColum);

                        if (foundRows.Length == 0)
                        {
                            DataRow dr = dt.NewRow();
                            dr["ColumnName"] = newColumn.Trim();
                            dt.Rows.Add(dr);
                        }
                        else
                        {
                            DataRow dr = dt.NewRow();
                            dr["ColumnName"] = newColumn.Trim() + "1";
                            dt.Rows.Add(dr);
                        }                         
                    }
                    else
                    {
                        string newColumn = column.Replace("'", "");
                        newColumn = newColumn.Replace(" ", "");
                        string clean = Regex.Replace(newColumn, "[^A-Za-z0-9 ]", "");
                        DataRow dr = dt.NewRow();
                        dr["ColumnName"] = clean.Trim();                            
                        dt.Rows.Add(dr);
                    }
                }

                foreach (DataRow row in dt.Rows)
                {
                    string colName = Regex.Replace(row["ColumnName"].ToString().Trim(), "/^[ A-Za-z0-9]*$/", "");

                    DataColumn datecolumn = new DataColumn(colName);
                    datecolumn.AllowDBNull = true;
                    csvData.Columns.Add(datecolumn);
                }

                while (!csvReader.EndOfData)
                {
                    string[] fieldData = csvReader.ReadFields();

                    for (int i = 0; i < fieldData.Length; i++)
                    {
                        if (fieldData[i] == "")
                        {
                            fieldData[i] = null;
                        }
                    }

                    foreach (string s in fieldData)
                    {
                         s.Replace("\"","");
                         Regex.Replace(s, "/^[ A-Za-z0-9 '@.()]", "");
                         string a = s;
                    }

                    csvData.Rows.Add(fieldData);
                }
            }
        }
        catch (Exception ex)
        {
        }
        return csvData;
    }

This is an example of how the data looks like:

enter image description here

Is there a way that I can work around this and make this work?

----- EDIT, Add data sample as text --------

    Name,Name,Email,Manager Level1,Level 1 manager's email,Manager Level2,Level 2 manager's email
    Adams, D. (Deon)              ,"Adams, Deon. (D)              ",username@email.com,"Masete, Thabo (B.T.)",username@email.com,"Fraser, Mervyn (M.)",username@email.com
    Akaramunkongwanit,  S. (Sirapra) ,"Akaramunkongwanit, Sirapra  (S.)",username@email.com> ,"Naraphirom, Suphajitphat (Pin.)",username@email.com,"Jeeradeepalung, Jirawat (Jee.)",username@email.com
    Angel, L. (Dave)              ,"Angel, Dave (L.) ",username@email.com,"Causton, Keith (K.H.) ",username@email.com,"White, Chris- Manf Eng (C.F.) ",username@email.com
    Apairat, J. (Janjira),"Apairat, Janjira (J.) "username@email.com,"Choksiriwanna, Phatthar  (Patsy.)",username@email.com,"Phusitpoykai, Rachawan (R.) ",username@email.com
Kerieks
  • 1,042
  • 7
  • 25
  • 53
  • could you post the sample of data as text so that it becomes copy-pastable?=! – Mong Zhu May 24 '17 at 07:58
  • @Mong Zhu, I added a sample, thanks – Kerieks May 24 '17 at 08:02
  • Who supplies that csv? It's a mess. – VDWWD May 24 '17 at 08:03
  • @ VDWWD, the customer supplied this. This is the format which they will use to import frequently... They only thing I found that works with this mess is a semi colon, which they won't accept... – Kerieks May 24 '17 at 08:06
  • There is the answer if you search a little: https://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file – Ivan Ičin May 24 '17 at 08:08
  • The last row of the CSV is invalid (missing comma between *"Apairat, Janjira (J.) "* and *username@email*). Read [RFC4180](https://tools.ietf.org/html/rfc4180) before working, parsing CSV files – Sir Rufo May 24 '17 at 08:13
  • And there are some NuGet packages out there dealing with CSV files https://www.nuget.org/packages?q=Tags%3A%22csv%22 - choose one of them (I like Filehelpers) – Sir Rufo May 24 '17 at 08:23
  • Working with CSV files can be annoying, and if you have an inconsistent format, I would suspect you'll have to parse the file yourself instead of relying on other components, so you can apply the relevant business logic to each line, field and possible character. – Allan S. Hansen May 24 '17 at 08:25
  • @AllanS.Hansen If you have an inconsistent format then fix that format. Contracts are known in software development and normal business and one part of the contract is to process only valid CSV files described in RFC4180 – Sir Rufo May 24 '17 at 08:30
  • I don't think you'll ever get a correct import from that. It is bound to generate lot's of mistakes. And guess who's fault that is gonna be. – VDWWD May 24 '17 at 08:35
  • Thanks for all the input, I will go through the links provided... The only solution I found so far was using a semicolon then remove all quotes and characters that don't belong, but thanks I will have a look at all the above possible solutions... – Kerieks May 24 '17 at 08:43
  • 1
    @SirRufo In the real world - sometimes suboptimal solutions will have to be used. A proper, consistent, format is preferable - but it is not always possible. Problems will have to be solved and often in non-textbook solutions. – Allan S. Hansen May 24 '17 at 08:50
  • @Kerieks Ask the customer to quote the first field too (the missing comma in the last line is hopefully a copypaste error) and you have a valid CSV you can import with one of the NuGet packages. – Sir Rufo May 24 '17 at 09:05
  • @Sir Rufo, thanks I will do so, and thanks I'm busy looking into the NuGet packages to see which will work.... – Kerieks May 24 '17 at 09:27

0 Answers0