-1

I am retrieving two CSVs from an API, one called students.csv similar to:

StudentNo,PreferredFirstnames,PreferredSurname,UPN
111, john, smith, john@email.com
222, jane, doe, jane@email.com

one called rooms.csv:

roomName, roomNo, students
room1, 1, {@{StudentNo=111; StudentName=john smith; StartDate=2018-01-01T00:00:00; EndDate=2018-07-06T00:00:00},....
room2, 2,{@{StudentNo=222; StudentName=jane doe; StartDate=2018-01-01T00:00:00; EndDate=2018-07-06T00:00:00},...   

The third column in rooms.csv is an array as provided by the API

What is the best way to consolidate the two into

StudentNo,PreferredFirstnames,PreferredSurname,UPN, roomName
111, john, smith, john@email.com, room1
222, jane, doe, jane@email.com, room2

Im thinking something like...

$rooms = Import-Csv rooms.csv
$students  = Import-Csv students.csv
$combined = $students | select-object StudentNo,PreferredSurname,PreferredFirstnames,UPN,
@{Name="roomName";Expression={ ForEach ($r in $rooms) {
    if ($r.Students.StudentNo.Contains($_.StudentNo) -eq "True") 
{return $r.roomName}}}} 

This works, but is the foreach the right way to go am i mixing things up or is there a more efficient way???

--- Original Post ---

With all of this information I need to compare the student data and update AzureAD and then compile a list of data including first name, last name, upn, room and others that are retrieved from AzureAD.

My issue is "efficiency". I have code that mostly works but it takes hours to run. Currently I am looping through students.csv and then for each student looping through rooms.csv to find the room they're in, and obviously waiting for multiple api calls in-between all this.

What is the most efficient way to find the room for each student? Is importing the CSV as a custom PSObject comparable to using hash tables?

Ian
  • 3
  • 1
  • 4
  • make one call to active directory that gets all rooms and users assigned to the rooms, save this into an array/list. Then you can just lookup the rooms that have the student in the array/list without multiple calls back to active directory – Any Moose Sep 12 '18 at 01:25
  • Also remember you can call all of the power of .net inside powershell... https://stackoverflow.com/questions/26123/accessing-net-components-from-powershell – Any Moose Sep 12 '18 at 01:27
  • 1
    Can you post your active directory query, please. We can then alter it into an answer for you... – Any Moose Sep 12 '18 at 01:29
  • 1
    Can you give us an estimate of how many students are in one file? The volume of data could influence the choice of a strategy. – Walter Mitty Sep 12 '18 at 01:55
  • 1
    Can you provide your working-but-not-so-efficient code? That would be helpful to potential respondents whether they suggest improvements or a complete rewrite. – Lance U. Matthews Sep 12 '18 at 03:45
  • Thanks for the help, I've updated the original post to clear a few things up. @BACON the codes is currently over 300 lines, as I say hopefully the original post is better now. – Ian Sep 12 '18 at 09:27
  • I think your proposed code makes things less, not more, clear. What is `$students`? Is it used to create `students.csv` or is it created from `students.csv`? The properties specified in `Select-Object` don't exist in `students.csv`. Similarly, when iterating over `$rooms` you are accessing `Type` and `Groupname` members that don't exist in `rooms.csv`. Also, how are you storing an array in the `students` column? Please provide a [MCVE] of your actual code and sample data. There is too much information that is conflicting or missing and it is difficult to comment on code we can't see. – Lance U. Matthews Sep 12 '18 at 21:04
  • Im trying for third time lucky – Ian Sep 13 '18 at 01:22
  • @Ian Did my answer solve your problem or at least point you in the right direction? Some feedback would be appropriate and appreciated considering the amount of time I've put into it. – Lance U. Matthews Sep 23 '18 at 18:25

1 Answers1

0

I was able to get your proposed code to work but it requires some tweaks to the code and data:

  • There must be some additional step where you are deserializing the students column of rooms.csv to a collection of objects. It appears to be a ScriptBlock that evaluates to an array of HashTables, but some changes to the CSV input are still needed:
    • The StartDate and EndDate properties need to be quoted and cast to [DateTime].
    • At least for rooms that contain multiple students, the value must be quoted so Import-Csv doesn't interpret the , separating array elements as an additional column.
  • The downside of using CSV as an intermediate format is the original property types are lost; everything becomes a [String] upon import. Sometimes it's desirable to cast back to the original type for efficiency purposes, and sometimes it's absolutely necessary in order for certain operations to work. You could cast those properties every time you use them, but I prefer to cast them once immediately after import.

With those changes rooms.csv becomes...

roomName, roomNo, students
room1, 1, "{@{StudentNo=111; StudentName='john smith'; StartDate=[DateTime] '2018-01-01T00:00:00'; EndDate=[DateTime] '2018-07-06T00:00:00'}}"
room2, 2, "{@{StudentNo=222; StudentName='jane doe'; StartDate=[DateTime] '2018-01-01T00:00:00'; EndDate=[DateTime] '2018-07-06T00:00:00'}}"

...and the script becomes...

# Replace the [String] property "students" with an array of [HashTable] property "Students"
$rooms = Import-Csv rooms.csv `
    | Select-Object `
        -ExcludeProperty 'students' `
        -Property '*', @{
            Name = 'Students'
            Expression = {
                $studentsText = $_.students
                $studentsScriptBlock = Invoke-Expression -Command $studentsText
                $studentsArray = @(& $studentsScriptBlock)

                return $studentsArray
            }
        }
# Replace the [String] property "StudentNo" with an [Int32] property of the same name
$students = Import-Csv students.csv `
    | Select-Object `
        -ExcludeProperty 'StudentNo' `
        -Property '*', @{
            Name = 'StudentNo'
            Expression = { [Int32] $_.StudentNo }
        }
$combined = $students `
    | Select-Object -Property `
        'StudentNo', `
        'PreferredSurname', `
        'PreferredFirstnames', `
        'UPN', `
        @{
            Name = "roomName";
            Expression = {
                foreach ($r in $rooms)
                {
                    if ($r.Students.StudentNo -contains $_.StudentNo)
                    {
                        return $r.roomName
                    }
                }

                #TODO: Return text indicating room not found?
            }
        }

The reason this can be slow is because you are performing a linear search - two of them, in fact - for every student object; first through the collection of rooms (foreach), then through the collection of students in each room (-contains). This can quickly turn into a lot of iterations and equality comparisons because in every room to which the current student is not assigned you are iterating the entire collection of that room's students, on and on until you do find the room for that student.

One easy optimization you can make when performing a linear search is to sort the items you're searching (in this case, the Students property will be ordered by the StudentNo property of each student)...

# Replace the [String] property "students" with an array of [HashTable] property "Students"
$rooms = Import-Csv rooms.csv `
    | Select-Object `
        -ExcludeProperty 'students' `
        -Property '*', @{
            Name = 'Students'
            Expression = {
                $studentsText = $_.students
                $studentsScriptBlock = Invoke-Expression -Command $studentsText
                $studentsArray = @(& $studentsScriptBlock) `
                    | Sort-Object -Property @{ Expression = { $_.StudentNo } }

                return $studentsArray
            }
        }

...and then when you're searching that same collection if you come across an item that is greater than the one you're searching for you know the remainder of the collection can't possibly contain what you're searching for and you can immediately abort the search...

@{
    Name = "roomName";
    Expression = {
        foreach ($r in $rooms)
        {
            # Requires $room.Students to be sorted by StudentNo
            foreach ($roomStudentNo in $r.Students.StudentNo)
            {
                if ($roomStudentNo -eq $_.StudentNo)
                {
                    # Return the matched room name and stop searching this and further rooms
                    return $r.roomName
                }
                elseif ($roomStudentNo -gt $_.StudentNo)
                {
                    # Stop searching this room
                    break
                }

                # $roomStudentNo is less than $_.StudentNo; keep searching this room
            }
        }

        #TODO: Return text indicating room not found?
    }
}

Better yet, with a sorted collection you can also perform a binary search, which is faster than a linear search*. The Array class already provides a BinarySearch static method, so we can accomplish this in less code, too...

@{
    Name = "roomName";
    Expression = {
        foreach ($r in $rooms)
        {
            # Requires $room.Students to be sorted by StudentNo
            if ([Array]::BinarySearch($r.Students.StudentNo, $_.StudentNo) -ge 0)
            {
                return $r.roomName
            }
        }

        #TODO: Return text indicating room not found?
    }
}

The way I would approach this problem, however, is to use a [HashTable] mapping a StudentNo to a room. There is a little preprocessing required to build the [HashTable] but this will provide constant-time lookups when retrieving the room for a student.

function GetRoomsByStudentNoTable()
{
    $table = @{ }

    foreach ($room in $rooms)
    {
        foreach ($student in $room.Students)
        {
            #NOTE: It is assumed each student belongs to at most one room
            $table[$student.StudentNo] = $room
        }
    }

    return $table
}

# Replace the [String] property "students" with an array of [HashTable] property "Students"
$rooms = Import-Csv rooms.csv `
    | Select-Object `
        -ExcludeProperty 'students' `
        -Property '*', @{
            Name = 'Students'
            Expression = {
                $studentsText = $_.students
                $studentsScriptBlock = Invoke-Expression -Command $studentsText
                $studentsArray = @(& $studentsScriptBlock)

                return $studentsArray
            }
        }
# Replace the [String] property "StudentNo" with an [Int32] property of the same name
$students = Import-Csv students.csv `
    | Select-Object `
        -ExcludeProperty 'StudentNo' `
        -Property '*', @{
            Name = 'StudentNo'
            Expression = { [Int32] $_.StudentNo }
        }
$roomsByStudentNo = GetRoomsByStudentNoTable
$combined = $students `
    | Select-Object -Property `
        'StudentNo', `
        'PreferredSurname', `
        'PreferredFirstnames', `
        'UPN', `
        @{
            Name = "roomName";
            Expression = {
                $room = $roomsByStudentNo[$_.StudentNo]
                if ($room -ne $null)
                {
                    return $room.roomName
                }

                #TODO: Return text indicating room not found?
            }
        }

You can ameliorate the hit of building $roomsByStudentNo by doing so at the same time as importing rooms.csv...

# Replace the [String] property "students" with an array of [HashTable] property "Students"
$rooms = Import-Csv rooms.csv `
    | Select-Object `
        -ExcludeProperty 'students' `
        -Property '*', @{
            Name = 'Students'
            Expression = {
                $studentsText = $_.students
                $studentsScriptBlock = Invoke-Expression -Command $studentsText
                $studentsArray = @(& $studentsScriptBlock)

                return $studentsArray
            }
        } `
    | ForEach-Object -Begin {
        $roomsByStudentNo = @{ }
    } -Process {
        foreach ($student in $_.Students)
        {
            #NOTE: It is assumed each student belongs to at most one room
            $roomsByStudentNo[$student.StudentNo] = $_
        }

        return $_
    }

*Except for on small arrays

Lance U. Matthews
  • 15,725
  • 6
  • 48
  • 68
  • Thanks for your efforts with this, it got me going in the right direction and the script now runs in under a minute. – Ian Oct 14 '18 at 09:58
  • Remember that on any question you can [upvote or downvote](https://stackoverflow.com/help/why-vote) answers based on whether you feel they are helpful or not, and on your own questions you can [accept an answer](https://stackoverflow.com/help/someone-answers) that you feel best solves your problem. – Lance U. Matthews Oct 18 '18 at 02:27