0

I have a TXT file with 1300 megabytes (huge thing). I want to build code that does two things:

  1. Every line contains a unique ID at the beginning. I want to check for all lines with the same unique ID if the conditions is met for that "group" of IDs. (This answers me: For how many lines with the unique ID X have all conditions been met)
  2. If the script is finished I want to remove all lines from the TXT where the condition was met (see 2). So I can rerun the script with another condition set to "narrow down" the whole document.

After few cycles I finally have a set of conditions that applies to all lines in the document. It seems that my current approach is very slow.( one cycle needs hours). My final result is a set of conditions that apply to all lines of code. If you find an easier way to do that, feel free to recommend. Help is welcome :)

Code so far (does not fullfill everything from 1&2)

foreach ($item in $liste)
{
    
    # Check Conditions
    if ( ($item -like "*XXX*") -and ($item -like "*YYY*") -and ($item -notlike "*ZZZ*")) { 
        
     # Add a line to a document to see which lines match condition                    
        Add-Content "C:\Desktop\it_seems_to_match.txt" "$item"
        
    # Retrieve the unique ID from the line and feed array.                
        $array += $item.Split("/")[1]

    # Remove the line from final document
        $liste = $liste -replace $item, ""         
           
    
    }

                              
} 
# Pipe the "new cleaned" list somewhere
    $liste | Set-Content -Path "C:\NewListToWorkWith.txt"
# Show me the counts
    $array | group | % { $h = @{} } { $h[$_.Name] = $_.Count } { $h } | Out-File "C:\Desktop\count.txt"

Demo Lines:

images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg

Julian
  • 57
  • 3
  • 6
  • If you're dealing with very large numbers of items in ```$liste``` then ```$array += $item.Split("/")[1]``` is going to get exponentially slower because it appends by *copying* the entire array and putting the new item at the end of the copy, and as ```$array``` gets bigger that takes longer and longer to do. Since you're only using ```$array``` to summarise the counts, consider tracking the counts inside your ```foreach``` loop instead - e.g. above the ```foreach``` put ```$counts = @{}``` and then instead of ```$array = ...``` do ```$name = $item.Split("/")[1]; $counts[$name] += 1```... – mclayton Feb 09 '23 at 10:02
  • Did you try Select-Object with -Distinct? See :https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/select-object?force_isolation=true&view=powershell-7.3. You do not have to split items for the first pass which will get rid of duplicates. – jdweng Feb 09 '23 at 10:46
  • Hi mclayton, I did exactly as you mentioned. The script still seems slow. When I Write-host the $counts its like 2 counts per second. My document has like xMillion lines.It seems my array is doing slow stuff. – Julian Feb 09 '23 at 10:56
  • @Julian - you've got a number of performance issues per @iRon's answer. The ```+=``` vs ```$counts``` optimisation will only really be evident after a large number of iterations since *that*'s when ```+=``` starts becoming progressively slower. If you want to measure the effect of any single change you should run the script to completion and see how long it takes as some optimisations won't be evident if you only use small datasets for input... – mclayton Feb 09 '23 at 11:23

2 Answers2

1

performance considerations:

$Name = $item.Split("/")[1]
if (!$HashTable.Contains($Name)) { $HashTable[$Name] = [Collections.Generic.List[String]]::new() }
$HashTable[$Name].Add($Item)
iRon
  • 20,463
  • 10
  • 53
  • 79
  • 1
    Ok, let me process this. Plenty of stuff, I will try it out and come back – Julian Feb 09 '23 at 11:00
  • Seems not to work properly. I stripped down everything I don't need. But it says You Cannot call a method on a null-valued expression. – Julian Feb 09 '23 at 13:48
  • 38 | if (!$HashTable.Contains($name)) { $HashTable[$name] = [C … | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | You cannot call a method on a null-valued expression. This is the loop: foreach ($item in $liste) { # Check Conditions if ( ($item -like "*_XXX_*") ) { $name = $item.Split("/")[1] if (!$HashTable.Contains($name)) { $HashTable[$name] = [Collections.Generic.List[String]]::new() } $HashTable[$name].Add($Item) } } – Julian Feb 09 '23 at 13:49
  • You need to define the hash table (`$HashTable = @{}`) somewhere at the beginning of your script – iRon Feb 09 '23 at 15:42
0

To minimize memory usage it may be better to read one line at a time and check if it already exists. Below code I used StringReader and you can replace with StreamReader for reading from a file. I'm checking if the entire string exists, but you may want to split the line. Notice I have duplicaes in the input but not in the dictionary. See code below :

$rows= @"
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/2XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGA/3XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/4XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGB/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
images/STRINGC/5XXXXXXXX_rTTTTw_GGGG1_Top_MMM1_YY02_ZZZ30_AAAA5.jpg
"@

$dict = [System.Collections.Generic.Dictionary[int, System.Collections.Generic.List[string]]]::new();
$reader = [System.IO.StringReader]::new($rows)
while(($row = $reader.ReadLine()) -ne $null)
{
   $hash = $row.GetHashCode()
   if($dict.ContainsKey($hash))
   {
      #check if list contains the string
      if($dict[$hash].Contains($row))
      {
         #string is a duplicate
      }
      else
      {
         #add string to dictionary value if it is not in list
         $list = $dict[$hash].Value
         $list.Add($row)
      }
   }
   else
   {
      #add new hash value to dictionary
      $list = [System.Collections.Generic.List[string]]::new();
      $list.Add($row)
      $dict.Add($hash, $list)
   }
}
$dict
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • Here is similar issue where StreamReader solved issue : https://stackoverflow.com/questions/75379716/trouble-splitting-a-9-gb-csv-file-via-powershell/75381425#comment133051033_75381425 – jdweng Feb 10 '23 at 11:08