-2

According to this post the maximum number of elements an array (in my example) can hold is 2 146 435 071. I want to split a string into a list but the elements could be well over 10 000 000 000, therefore this may imply without using the Split method but if it can still be used that is also okay.
How can I do this with the best performance?

  • Tried removing the first occurrence (described here) recursively while adding to the list until there is no delimiter but this is very slow
  • The resulting elements of the list may span through several lines

Here is my code before any changes:

var allTokens = allText.Split(Delimiters).ToList();

Example of allText value:

fgfg,ghgh,"gjhj
hghdg,hjhgj",ghg
ghgh,kiwj,fhgfg,
hsk,,jw,"address line1
adrress line 2
zip code
country"

Problem: Large file throws OutOfMemoryException

Bonron
  • 9
  • 6
  • Dear lord that's a lot of text. Is it not an option to split your original string into multiple strings, and then split them into arrays which are under the maximum size? – Sach Aug 18 '17 at 18:44
  • 2
    I recommend you read one line at a time from the file and do your processing from there rather than reading the entire file into memory. Also whatever you need that list for I suggest dealing with it as you read from the file instead of having it all loaded in a list if possible. Also based on your example I don't think `Split` is what you want as I'm guessing you don't want to split on commas inside of double quotes anyway. – juharr Aug 18 '17 at 18:44
  • 3
    You cannot have a `System.String` (before the splitting) that is so long, either. With usual large object limits, a string can be no longer than approx `2 ** 30` (two to the 30th power) `char` values of 2 bytes each. – Jeppe Stig Nielsen Aug 18 '17 at 19:14
  • thanks @Sach I used your suggestion – Bonron Aug 19 '17 at 15:30
  • I avoided reading line by line because gjhj\nhghdg in the example above will end up not being one token. Also noticed my max was just above 85 900 000 on the array – Bonron Aug 19 '17 at 15:31
  • @juharr your guess is right but I may rejoin after I split although in some cases I rejoin based on a different char thus I split all. – Bonron Aug 20 '17 at 16:41

1 Answers1

0

To solve this I have the pseudo-code below:

[1] check if allText is large (length greater than a certain value - depends on the expected frequency of delimiters in string)
[2] repeat: if large enough split allText into two strings
[3] split the resulting multiple strings using Split method (or allText is small) [4] initialise list
[5] addRange of all the string[] from [3]

This follows comment by @Sach on question

Bonron
  • 9
  • 6