0

I am trying to perform some analysis on a text file with approximately ten million lines containing passwords. I was doing this by reading each line of the file, creating a class with the value as a parameter, and then adding that class to a list. After line 4,000,000 I get an out of memory exception. Short of storing everything in a SQL database, is there anything else that could be done?

Edit: What I am trying to do is take the password, add it to a Credential object, and then add that to a list.

public class Credential
    {
        public string Password { get; set; }

        public static readonly List<string> specialCharacters = new List<string> { "@", "!", "~", "*", "^", "&", "\\", "/", "#", "$", "%", "<", ">", ".", ",", "?", ")", "(", "'", "\"", "+", "=", "_", "-", ";", ":", "{", "}", "]", "[", };

        public Credential(string password)
        {
            this.Password = password;
            this.Mapping = new Dictionary<int, CredentialValueType>();
            for (var i = 0; i < this.Length; i++)
            {
                this.Mapping.Add(i, new CredentialValueType(this.Password[i]));
            }
        }


        public Dictionary<int, CredentialValueType> Mapping { get; private set; }

        public int Length
        {
            get
            {
                return this.Password.Length;
            }
        }
        public bool HasUppercase
        {
            get
            {
                return this.Password.Any(c => char.IsUpper(c));
            }
        }
        public bool HasLowercase
        {
            get
            {
                return this.Password.Any(c => char.IsLower(c));
            }
        }
        public bool HasNumber
        {
            get
            {
                return this.Password.Any(c => char.IsNumber(c));
            }
        }
        public bool HasSpecialCharacter
        {
            get //Verify that this works right...
            {
                return this.Password.Where(a => specialCharacters.Contains(a.ToString())).Count() > 0;
            }
        }
    }

public struct CredentialValueType
{
    public char Value { get; set; }
    public ValueType ValueType { get; set; }

    public CredentialValueType(char val)
    {
        this = new CredentialValueType();
        this.Value = val;
        if (char.IsUpper(val)) this.ValueType = ValueType.UpperCase;
        else if (char.IsLower(val)) this.ValueType = PasswordStats.ValueType.LowerCase;
        else if (char.IsNumber(val)) this.ValueType = PasswordStats.ValueType.Number;
        else this.ValueType = PasswordStats.ValueType.SpecialCharacter;
    }
}

My function is as follows:

public class PasswordAnalyzer
    {
        public IList<Credential> Credentials { get; private set; }

        public PasswordAnalyzer(string file, int passwordField = 0, Delimiter delim = Delimiter.Comma)
        {
            this.Credentials = new List<Credential>();
            using (var fileReader = File.OpenText(file)) //Verify UTF-8
            {
                using (var csvReader = new CsvHelper.CsvReader(fileReader))
                {
                    csvReader.Configuration.Delimiter = "\t";
                    while (csvReader.Read())
                    {
                        var record = csvReader.GetField<string>(passwordField);
                        this.Credentials.Add(new Credential(record));
                        System.Diagnostics.Debug.WriteLine(this.Credentials.Count);
                    }
                }
            }
        }
    }
appsecguy
  • 1,019
  • 3
  • 19
  • 36
  • 1
    What's your actual code? Are you using File.ReadLines() ? – Нет войне Feb 16 '15 at 17:59
  • This seems to address a similar issue: http://stackoverflow.com/questions/27561324/what-is-the-fast-process-to-find-the-duplicate-row-from-a-csv-file/27561351#27561351 – Krumelur Feb 16 '15 at 17:59
  • Buy more RAM. Or do your analysis in increments (like 1M at a time). – Pierre-Luc Pineault Feb 16 '15 at 18:00
  • 1. Get more memory. 2. Can you process in batches? 3. If doing any kind of aggregation, aggregate as you process (store sums and counts in separate variables and increment as you go to avoid loading all into memory) 4. More detail on the type of analysis you are trying to do would be helpful. We are just stabbing in the dark. – Jeremy Feb 16 '15 at 18:01
  • 2
    @Pierre-LucPineault getting more RAM is not going to magically make more address space available for 32 bit process... – Alexei Levenkov Feb 16 '15 at 18:01
  • I would expect this to fit easily in even a 32bit memory space. So unless this is on a Phone, you're doing something wrong. For a serious answer, show the code and/or detail how much data is on a 'line'. – H H Feb 16 '15 at 18:02
  • After the Edit: Nothing obviously wrong, do check what `GetField(passwordField)` actually returns. – H H Feb 16 '15 at 18:53
  • @HenkHolterman Keep in mind the object needs to not just fit in 32 bit space, but have a contiguous block of free memory. Also note that in creating the list many intermediate backing arrays will have been created and discarded, both consuming memory and also fragmenting it. This can result in errors even when there is more than enough actual free memory. – Servy Feb 16 '15 at 19:19
  • @Servy: I know all that. But you normally don't get enough fragmentation on the LOH with a 10M List. Not even close. – H H Feb 16 '15 at 19:21
  • The `IList` in your example is clearly the bottleneck that is limited by physical RAM and address space. Your example shows you are reading the credentials into a list, but doesn't show what you are *doing* with that list. What is compelling you to put the *entire* list into RAM in the first place? – NightOwl888 Feb 16 '15 at 19:30

2 Answers2

2

Rather than creating 4 million dictionaries, you could store your Mapping in an array. I'm sure that'll save a lot of room, but without more information regarding how much memory is being consumed and so forth, it's hard to tell if this will resolve your problem.

I'm assuming your code shown is not your actual code, but if you're just needing to iterate through the lines, use an IEnumerable, and yield each result. You'll be much nicer on the memory requirements since you'll only have one "line" in memory at a time.

Daryl
  • 18,592
  • 9
  • 78
  • 145
  • Not only do we have the overhead of 4 million dictionaries here but we have a dictionary entry for every character in every password. If we have an average of 8 chars/password that's 32 million entries. Each entry at a minimum consists of 4 bytes for the key and another byte for the type, no doubt padded out to 4 bytes. I don't know the internals of a dictionary but I rather suspect it points to the data--another 4 bytes. We are already up to 384 megabytes at a minimum. I'm sure there's more used by the allocated block tracking but I don't know the details. – Loren Pechtel Feb 16 '15 at 21:25
-1

If you add the [Serializable] attribute to your Credential and CredentialValueType classes, you can store their state in a file stream rather than an in-memory list.

[Serializable]
public class Credential
{
    //code omitted
}

[Serializable]
public class CredentialValueType
{
    //code omitted
}

Store your credential objects.

var binFormatter = new Runtime.Serialization.Formatters.Binary.BinaryFormatter();

// Open a file stream to write the objects into
using (var fs = new FileStream(@"C:\temp.dat", FileMode.Create))
{

    // Begin looping through your file here.

    // Get a line from the file and convert it to an object
    var credential = new Credential(line);

    // Serialize your Credential object onto the stream.
    binFormatter.Serialize(fs, credential);

    // End looping through your file here

    // Ensure the buffer is flushed before closing the stream.
    fs.Flush();
}

Now your credential objects can be processed one at a time by reading them back and deserializing them.

var binFormatter = new Runtime.Serialization.Formatters.Binary.BinaryFormatter();

using (var fs = new FileStream(@"C:\temp.dat", FileMode.OpenOrCreate, FileAccess.Read))
{
    do
    {
        // Deserialize the credential from the file stream
        var credential = (Credential)binFormatter.Deserialize(fs);

        // Process the credential here

    // Loop until the end of the file
    } while (fs.Position < fs.Length - 1)

}

I have been using this to create feed files that are several GB in size on a machine that has less memory than the size of the file.

The downside is that you can't use LINQ on the list of Credential objects in conjunction with this technique. But if you know the types of things you are scanning for, you can optimize your process so you only need to loop through the CSV file once, and then you can loop through your Credential objects more than once to find the data you are looking for.

NightOwl888
  • 55,572
  • 24
  • 139
  • 212