0

I have imported a very large CSV-table via Powershell into an array of objects. Each object may look like this definition:

$test = [PsCustomObject]@{id='1';name='a'}

The problem is, that all types are 'string' and I need to work with the correct types later on in my code. For this I want to use an embedded C# code that converts the values of all object property into a string-array to add its values into a DataTable with correct type definitions.

I am now struggling on the C# part. Here is my code-sample which is not working. How can I change the C# part to do the conversion of the object-values into an array?

# this code should convert the object-values into an array:
Add-Type -ReferencedAssemblies System.Data -TypeDefinition '
using System;
public class row {
    public object[] row(object test) {
        int id = Int32.Parse(test.id);
        string name = test.name;
        return (id, name);
    }
}'

# this is the test-object:
$test = [PsCustomObject]@{id='1';name='a'}

# this is the final table with correct types:
$table = [System.Data.DataTable]::new()
[void]$table.Columns.Add('id', [int])
[void]$table.Columns.Add('name', [string])

# at the end it should work like this:
$row = [row]::new($test)
$table.Rows.Add($row)

I did some tests without C#, but this is very slow. E.g. this simple loop (even without adding the data into the row) runs over 20 seconds:

$test = [PsCustomObject]@{id='1';name='a'}
foreach($x in 1..1MB) {
    $row = foreach($i in $test.PsObject.Properties.Value){if ($i) {$i} else {$null}}
    #[void]$table.rows.Add($row)
}

So in theory I need to do the same like in the last code-block but via embedded Csharp code.

How can I get this done in an efficient way?

Update #1: Thanks to the input from Theo I could speed-up the conversion. I did not expect that this would be 5-times faster than just querying PsObject-properties. E.g. it turns out, that an 'else'-statement is slower than just assigning the var first. Here is the code for comparison:

$test = [PsCustomObject]@{id='1';name='a'}
foreach($x in 1..1MB) {
    $id = $test.id
    if ([string]::IsNullOrEmpty($id)){$id = $null}
    $name = $test.name
    $row = @($id, $name)
}

But it is still the slowest part in my overall code I am still looking for a smart C# solution. My idea is, that if there are any other properties for the input-object later, then I can just dynamically rebuild the C# code. That will not work for pure PS-code.

Update #2: Based on the input from BACON I was able to solve the challenge with the C# code. Here is my working implementation:

Add-Type -TypeDefinition '
using System;
public class test {
    public string id {get;set;}
    public string name {get;set;}
}
public static class DataParser {
    public static string[] ParseToArray(test data) {
        string id = data.id;
        if (String.IsNullOrEmpty(id)) {id = null;};
        string name = data.name;
        return new string[] {id,name};
    }
}'

# this is the test-object:
$test = [PsCustomObject]@{id='';name='a'}

$timer = [System.Diagnostics.Stopwatch]::StartNew()
foreach($x in 1..1MB) {
    $row = [DataParser]::ParseToArray($test)
}
$timer.Stop()
$timer.Elapsed.TotalSeconds

What I did not expect is the runtime of this solution - it is way slower than the pure PowerShell version which I posted above. So my conclusion is "mission accomplished", but we missed the target. Which means that there is no real efficient way to convert object values into an array.

As a result of this, I will step away from importing CSV-data as objects and will focus on importing large data as XML via 'dataSet.readXML'. I only wish there would be a build-in option to directly import CSV-data as arrays or dataRows.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
Carsten
  • 1,612
  • 14
  • 21
  • Does this answer your question? [Convert from PSObject to Datatable](https://stackoverflow.com/questions/46352448/convert-from-psobject-to-datatable) and [How to set the default output of a PowerShell class](https://stackoverflow.com/q/60487695/1701026) – iRon Aug 08 '20 at 12:08
  • Not really. I need to insert some hundreds of thousand objects of the above format into the datatable. This is why I am looking for an implementation via CSharp to do the conversion from the attribute-values into an array that can be used to create a datarow. – Carsten Aug 08 '20 at 14:53
  • How can I get a negative rating for a question related to a real challenge I had? – Carsten Aug 29 '20 at 09:11
  • I guess, that you question got (initially) downvoted as is a bit of an [XY Problem](https://en.wikipedia.org/wiki/XY_problem) as also appears from your conclusion. Why converting it to a DataTable? What makes you think that your C# class will outperform Native PowerShell? If you properly stream objects through PowerShell pipeline you save a lot a memory it might simply be faster but you not showing anything that is "slow". By the way, the actual problem originates from the csv file (not PowerShell) that doesn't support any other type than a string. You might simply use a calculated expression: – iRon Aug 30 '20 at 07:55
  • `.. | Select-Object @{n='ID'; e={ [Int]$_.ID }}, Name | ...` – iRon Aug 30 '20 at 07:55
  • Anyways, I have upvoted it as "***This question shows research effort, it is be useful and is clear***". From my view, it might especially be useful for people underestimating PowerShell based on their C# knowledge... . What still can be can be improved in order to help you, is a performance reference to a common PowerShell you tried. In formation about the input (as apparently you do have control over this) and output (why a datatable?). This is especially important for PowerShell as a well written PowerShell end-to-end (pipeline) solution is supposed to be faster then the sum of its parts. – iRon Aug 30 '20 at 08:49
  • I pointed out twice — once in **bold** — why this usage of C# is not improving performance, and twice that you need to show more code. Instead, the question was edited to conclude "we missed the target" (we?) and that this approach can't work. In response to my comment suggesting you re-read parts of my answer, you asked a question that would have been answered by doing just that! When the author of a question clearly isn't reading the feedback they solicit, I'm not going to keep repeating myself; I'm going to warn future readers not to bother. @iRon is exactly right about the XY problem, too. – Lance U. Matthews Aug 31 '20 at 03:24
  • @BACON: I read your answer and voted for that plus I put it into the updates to demo the technique and to prove that you are right that this will NOT make the script faster. – Carsten Sep 04 '20 at 16:04

2 Answers2

2

A pure PowerShell solution would be:

[int]$refInt = 0 # create an int as reference variable for TryParse()
foreach($item in $test) {
    # get the int value or $null for the id property
    $rowId = if ([int]::TryParse($item.id, [ref]$refInt)) { $refInt } else { $null }
    # get the string value or $null for the name property
    $rowName = $item.name.ToString()    # added ToString() for good measure
    if ([string]::IsNullOrWhiteSpace($rowName)) { $rowName = $null }
    # add a new row to the table
    $newRow = $table.NewRow()
    $newRow["id"]   = $rowId
    $newRow["Name"] = $rowName
    $null = $table.Rows.Add($newRow)
}

I'm not really into C#, but I think you need to use TryParse() there aswell in order to get either an int or a $null value. As for the name property you should also check this for NullOrWhiteSpace and use the ToString() method on it to make sure you get a valid string or $null.

Theo
  • 57,719
  • 8
  • 24
  • 41
  • If `TryParse()` fails it will set its second parameter to `0`, so if that's outside the range of valid IDs and one doesn't need `$null`, specifically, to represent "failed to parse" then `[int] $rowId = -1; [void] [int]::TryParse($item.id, [ref] $rowId)` would be a bit faster. The code in the question does not set the [`AllowDBNull` property](https://learn.microsoft.com/dotnet/api/system.data.datacolumn.allowdbnull) of the `DataColumn`s it creates, although it does default to `$true`. – Lance U. Matthews Aug 10 '20 at 23:37
  • Is there any way to dynamically create the powershell code for this convertion to skip the very slow implementation around reading PSObject.properties of an object and using the properties in a hardcoded way later in the code? – Carsten Aug 11 '20 at 07:06
2

You don't say how your C#-based attempt is "not working", but I can see some problems with...

using System;
public class row {
    public object[] row(object test) {
        int id = Int32.Parse(test.id);
        string name = test.name;
        return (id, name);
    }
}

This is defining a class named row with an instance method (not a constructor) also named row. You don't define any constructors so the row class will have only a default, parameterless constructor. When you do...

$row = [row]::new($test)

...you are trying to invoke a row constructor overload that doesn't exist.

Further, the return type of row() is object[], yet (id, name) is a (value) tuple, not an array. Some conversion from the former to the latter is necessary for that to compile.

Invoking your Add-Type command I am reminded that...

Add-Type: (4,21): error CS0542: 'row': member names cannot be the same as their enclosing type
   public object[] row(object test) {
                   ^

...which explains itself, and...

Add-Type: (5,35): error CS1061: 'object' does not contain a definition for 'id' and no accessible extension method 'id' accepting a first argument of type 'object' could be found (are you missing a using directive or an assembly reference?)

...which means that since the compile-time type of the test parameter is object, unless you cast it to a more specific type you'll only be able to access the members of the object class, which has no properties or fields. Since the run-time type of test will be PSCustomObject — which is a bit of a "magical" type — the typical (and slow) C# way of dynamically accessing the id and name properties with reflection won't work.

So, the problem is, basically, that despite having some superficial similarities, C# is very different than PowerShell and cannot be written as such. You could fix the issues above by stuffing the values into a more friendly type for C# to access, like this...

using System;
public static class DataParser {
    public static object[] ParseToArray(Tuple<string, string> data) {
        int id = int.Parse(data.Item1);
        string name = data.Item2;
        return new object[] { id, name };
    }
}

Note that Tuple<,> generic type is not the same kind of tuple as linked previously; that one requires C# 7.0, so, for better compatibility, I'm not using it here. You could then call the above method like this...

$testTuple = [Tuple]::Create($test.id, $test.name)
$testAsArray = [DataParser]::ParseToArray($testTuple)
$table.Rows.Add($testAsArray)

Even simpler would be to eliminate the intermediate object and just pass the properties via parameters...

using System;
public static class DataParser {
    public static object[] ParseToArray(string id, string name) {
        return new object[] { int.Parse(id), name };
    }
}

...and call it like this...

$testAsArray = [DataParser]::ParseToArray($test.id, $test.name)
$table.Rows.Add($testAsArray)

Seeing as how neither method implementation is doing much more than stuffing their inputs into an array, the next — and best — optimization is to recognize that the C# code isn't doing enough work to justify its usage and remove it entirely. Thus, we just create the array directly in PowerShell...

$testAsArray = [Int32]::Parse($test.id), $test.name
$table.Rows.Add($testAsArray)

Now, that simplifies your code, but it doesn't achieve the goal of making it faster. As I said, you need to do more work inside the C# method — like accepting all of the input records, parsing them as appropriate, and populating the DataTable — to make it worthwhile. For that, I think you'd need to show more of your code; specifically, how you go from CSV text to in-memory records, and if each record really is stored as a PSCustomObject (as returned by Import-Csv) or something else.

Lance U. Matthews
  • 15,725
  • 6
  • 48
  • 68
  • Thank you for your answer. That is very helpful and gives a lot of insights to me. Unfortunately it does not work for my idea to dynamically create the C# code to make sure that the PS-part will not include any hardcoded properties like 'id' and 'name'. At the end I want to use that c# snippet to convert ANY type of object values into an array. Especially when the object has more than 8 properties which makes it impossible to work with tuples without nesting them. – Carsten Aug 11 '20 at 06:06
  • Unless you want it to parse anything that looks like an `[Int32]` to an `[Int32]`, etc., you're going to have to hard-code either properties names or indices _somewhere_ so the code, whether C# or PowerShell, knows which properties need to be converted to which types. [Here's a recent PowerShell answer of mine](https://stackoverflow.com/a/61672548/150605) that implements both approaches. Without seeing more code I can't offer anything further, but as far as the questions you posed in your deleted answer I suggest you re-read the last two paragraphs of this answer. – Lance U. Matthews Aug 11 '20 at 07:06
  • Thank you for helping here. Please see my update#2 if this makes sense or not. Any visible reason why this is so slow? – Carsten Aug 11 '20 at 07:08