-1

My task is to check which of the elements of a column in one csv are not included in the elements of a column in the other csv. There is a country column in both csv and the task is to check which countries are not in the secong csv but are in the first csv.

I guess I have to solve it with Lists after I read the strings from the two csv. But I dont know how to check which items in the first list are not in the other list and then put it to a third list.

nvoigt
  • 75,013
  • 26
  • 93
  • 142
  • [List.IndexOf](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.list-1.indexof?view=net-5.0) is your friend. – 500 - Internal Server Error Oct 12 '21 at 22:46
  • 1
    @500-InternalServerError heck no.. HashSet and/or Dictionary – Caius Jard Oct 12 '21 at 22:56
  • 2
    So what is the problem here? What code have you tried and where are you having trouble. We can only guess, as you have, your code attempt will clarify your problem for us and provide a context so we can respond in a way that will be useful to you. Ultimately you need record all the unique countries in the _second_ CSV, array or list is fine, we're not talking about thousands of entries. Then you can just use a reader across the _first_ CSV and only record the lines (or countries) that did not exist in the second one. – Chris Schaller Oct 13 '21 at 00:05
  • In your code example we are interested to see how you are parsing the csv, are you doing this from first principals with a reader or string split, are you using using a library like [CsvHelper](https://www.nuget.org/packages/CsvHelper)? – Chris Schaller Oct 13 '21 at 00:35
  • Please provide enough code so others can better understand or reproduce the problem. – Community Oct 14 '21 at 12:56

2 Answers2

0

There are many way to achieve this, for many real world CSV applications it is helpful to read the CSV input into a typed in-memory store there are standard libraries that can assist with this like CsvHelper as explained in this canonical post: Parsing CSV files in C#, with header

However for this simple requirement we only need to parse the values for Country form the master list, in this case the second csv. We don't need to manage, validate or parse any of the other fields in the CSVs

  1. Build a list of unique Country values from the second csv
  2. Iterate the first csv
    1. Get the Country value
    2. Check against the list of countries from the second csv
    3. Write to the third csv if the country was not found

You can test the following code on .NET Fiddle

NOTE: this code uses StringWriter and StringReader as their interfaces are the same as the file reader and writers in the System.IO namespace. but we can remove the complexity associated with file access for this simple requirement

string inputcsv = @"Id,Field1,Field2,Country,Field3
1,one,two,Australia,three
2,one,two,New Zealand,three
3,one,two,Indonesia,three
4,one,two,China,three
5,one,two,Japan,three";

string masterCsv = @"Field1,Country,Field2
one,Indonesia,...
one,China,...
one,Japan,...";

string errorCsv = "";

// For all in inputCsv where the country value is not listed in the masterCsv
// Write to errorCsv
 
// Step 1: Build a list of unique Country values
bool csvHasHeader = true;
int countryIndexInMaster = 1;
char delimiter = ',';
List<string> countries = new List<string>();
using (var masterReader = new System.IO.StringReader(masterCsv))
{
    string line = null;
    if (csvHasHeader)
    {
        line = masterReader.ReadLine();
        // an example of how to find the column index from first principals
        if(line != null)
            countryIndexInMaster = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
    }
    while ((line = masterReader.ReadLine()) != null)
    {
        string country = line.Split(delimiter)[countryIndexInMaster].Trim('"');
        if (!countries.Contains(country))
            countries.Add(country);
    }
}

// Read the input CSV, if the country is not in the master list "countries", write it to the errorCsv
int countryIndexInInput = 3;
csvHasHeader = true;

var outputStringBuilder = new System.Text.StringBuilder();
using (var outputWriter = new System.IO.StringWriter(outputStringBuilder))
using (var inputReader = new System.IO.StringReader(inputcsv))
{
    string line = null;
    if (csvHasHeader)
    {
        line = inputReader.ReadLine();
        if (line != null)
        {
            countryIndexInInput = line.Split(delimiter).ToList().FindIndex(x => x.Trim('"').Equals("Country", StringComparison.OrdinalIgnoreCase));
            outputWriter.WriteLine(line);
        }
    }
    while ((line = inputReader.ReadLine()) != null)
    {
        string country = line.Split(delimiter)[countryIndexInInput].Trim('"');
        if(!countries.Contains(country))
        {
            outputWriter.WriteLine(line);
        }
    }
    outputWriter.Flush();
    errorCsv = outputWriter.ToString();
}

// dump output to the console
Console.WriteLine(errorCsv);
Chris Schaller
  • 13,704
  • 3
  • 43
  • 81
0

Since you write about solving it with lists, I assume you can load those values from the CSV to the lists, so let's start with:

List<string> countriesIn1st = LoadDataFrom1stCsv();
List<string> countriesIn2nd = LoadDataFrom2ndCsv();

Then you can easily solve it with linq:

List<string> countriesNotIn2nd = countriesIn1st.Where(country => !countriesIn2nd.Contains(country)).ToList();

Now you have your third list with countries that are in first, but not in the second list. You can save it.

JTO
  • 156
  • 2