2

I am reading IMDB movies listing from a text file on my harddrive (originally available from IMDB site at ftp://ftp.fu-berlin.de/pub/misc/movies/database/movies.list.gz).

It takes around 5 minutes on my machine (basic info: Win7 x64bit, 16GB RAM, 500 GB SATA Hardisk 7200 RPM) to read this file line by line using code below.

I have two questions:

  1. Is there any way I can optimize code to improve the read time?

  2. Data access don't need to be sequential as I won't mind reading data from top to bottom / bottom to top or any order for that matter as long as it read one line at a time. I am wondering is there a way to read in multiple directions to improve the read time?

The application is a Windows Console Application.

Update: Many responses correctly pointed out that Writing to the Console takes substantial time. Considering that the displaying of data on the Windows Console is now desirable but not mandatory.

//Code Block

string file = @"D:\movies.list";

FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.None, 8, FileOptions.None);

using (StreamReader sr = new StreamReader(fs))
{
  while (sr.Peek() >= 0)
  {
    Console.WriteLine(sr.ReadLine());
  }
}
Dave
  • 59
  • 1
  • 4
  • 1
    Console.Writeline will take a while if you do that for every line. What is the sence in listing **all** movies? – juergen d Jul 07 '12 at 04:51
  • Removing Writeline is not an issue but listing all the movies is a requirement and cannot be changed. Also as I mention in my original question that sequence is not important. – Dave Jul 07 '12 at 05:46
  • I would be concerned that reading from multiple directions at once is going to dramatically slow things down because of the back-and-forth movement of the read heads. – Michael Jul 07 '12 at 06:14
  • It is the WriteLine to console that takes all the time. I doubt that Console.WriteLine is a requirement, noone want to read 2.2E6 rows from the console. Writting the relevant data to a file is much faster. My computer read the file in 1.8 seconds using File.ReadAllLines and it used 148 seconds to write it to console (a very useless accomplishment in my oppinion) – Casperah Jul 07 '12 at 07:36
  • any final solution with full source code? – Kiquenet Aug 14 '12 at 09:22

5 Answers5

0

I'm not certain whether this is more efficient or not, but an alternate method would be to use File.ReadAllLines:

var movieFile = File.ReadAllLines(file);
foreach (var movie in movieFile)
    Console.WriteLine(movie);
Grant Winney
  • 65,241
  • 13
  • 115
  • 165
  • I tried it but this actually takes more time to iterate through the lines. 18 seconds to load all the lines into the memory and the iteration part is still going on (already passed 20 mins mark) – Dave Jul 07 '12 at 04:48
  • Ouch! Interesting though. I downloaded the file to see how big it was. 100 MB file containing over 2.26 million lines. – Grant Winney Jul 07 '12 at 05:03
  • Correct. (movieFile.Length was 2265140). It finally finished execution after 22 mins. One advantage (sort of) by loading all lines into memory is that you can get total lines but I am still looking for solution where you can read file in multiple directions and improve performance. – Dave Jul 07 '12 at 05:08
0

In .net 4 you can use File.ReadLines for lazy evaluation and thus lower RAM usage when working on large files.

You can do linq operation directly on files and this along with File.ReadLines would improve load time.

For better understanding you can check, Read text file word-by-word using LINQ

You can also do comparison as well but putting time intervals.

However if you making web app you can read whole file on application start event and cache them in application pool for better performanace.

Community
  • 1
  • 1
Jigar Pandya
  • 6,004
  • 2
  • 27
  • 45
  • Interesting.Curious about how to iterate through bytes as compare to strings? ideas? – Dave Jul 07 '12 at 05:12
0

I am not a c# developer, but how about doing a bulk insert into database using the file(which will be one time). Then you can reuse the data and export as well.

0

The answer to this question really depends on what it is you will be doing with the data. If your intention truly is to just read in the file and dump the contents to the console screen, then it would be better to use the StringBuilder Class to build up a string of, say 1000 lines, then dump the contents to the screen, reset the string then read in another 1000 lines, dump them, etc etc...

However if you are trying to build something that is part of a larger project and you are using .NET 4.0, you can use the MemoryMappedFile Class to read the file and create a CreateViewAccessor to create a "window" that operates on just a portion of the data instead of reading in the entire file.

Another option would be to make Threads that read different parts of the file all at once, then puts it all together in the end.

If you can be more specific as to what you plan to do with this data, I can help you more. Hope this helps!

EDIT:

Try this code out man. I was able to read the whole list in literally 3 seconds time using Threads:

using System;
using System.IO;
using System.Text;
using System.Threading;

namespace ConsoleApplication36
{
    class Program
    {
        private const string FileName = @"C:\Users\Public\movies.list";
        private const long ThreadReadBlockSize = 50000;
        private const int NumberOfThreads = 4;
        private static byte[] _inputString;

        static void Main(string[] args)
        {
            var fi = new FileInfo(FileName);
            long totalBytesRead = 0;
            long fileLength = fi.Length;
            long readPosition = 0L;
            Console.WriteLine("Reading Lines From {0}", FileName);
            var threads = new Thread[NumberOfThreads];
            var instances = new ReadThread[NumberOfThreads];
            _inputString = new byte[fileLength];

            while (totalBytesRead < fileLength)
            {
                for (int i = 0; i < NumberOfThreads; i++)
                {
                    var rt = new ReadThread { StartPosition = readPosition, BlockSize = ThreadReadBlockSize };
                    instances[i] = rt;
                    threads[i] = new Thread(rt.Read);
                    threads[i].Start();
                    readPosition += ThreadReadBlockSize;
                }
                for (int i = 0; i < NumberOfThreads; i++)
                {
                    threads[i].Join();
                }
                for (int i = 0; i < NumberOfThreads; i++)
                {
                    if (instances[i].BlockSize > 0)
                    {
                        Array.Copy(instances[i].Output, 0L, _inputString, instances[i].StartPosition,
                                   instances[i].BlockSize);
                        totalBytesRead += instances[i].BlockSize;
                    }
                }
            }

            string finalString = Encoding.ASCII.GetString(_inputString);
            Console.WriteLine(finalString.Substring(104250000, 50000));
        }

        private class ReadThread
        {
            public long StartPosition { get; set; }
            public long BlockSize { get; set; }
            public byte[] Output { get; private set; }

            public void Read()
            {
                Output = new byte[BlockSize];
                var inStream = new FileStream(FileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
                inStream.Seek(StartPosition, SeekOrigin.Begin);
                BlockSize = inStream.Read(Output, 0, (int)BlockSize);
                inStream.Close();
            }
        }
    }
}

You will need to change the FileName to match the location of your movies.list file. Also, you can adjust the total number of threads. I used 4, but you can decrease or increase this at will. You can also change the Block Size...This is how much data each thread reads in. Also, I'm assuming its an ASCII text file. If its not, you need to change the encoding type to UTF8 or whatever encoding the file is in. Good luck!

Icemanind
  • 47,519
  • 50
  • 171
  • 296
  • I will try StringBuilder approach and post the findings. But I am more inclined to explore MemoryMappedFile and Multithreading approach. For a moment you can assume that you don't have to display data on the screen. Also reading order is not an issue at this point in time. – Dave Jul 07 '12 at 15:40
0

First of all, if you don't care about printing out the list to console, please edit your question.

Second, I created a timing program to test the speeds of the different methods suggested:

class Program
{
    private static readonly string file = @"movies.list";

    private static readonly int testStart = 1;
    private static readonly int numOfTests = 2;
    private static readonly int MinTimingVal = 1000;

    private static string[] testNames = new string[] {            
        "Naive",
        "OneCallToWrite",
        "SomeCallsToWrite",
        "InParallel",
        "InParallelBlcoks",
        "IceManMinds",
        "TestTiming"
        };

    private static double[] avgSecs = new double[numOfTests];

    private static int[] testIterations = new int[numOfTests];

    public static void Main(string[] args)
    {
        Console.WriteLine("Starting tests...");
        Debug.WriteLine("Starting tests...");

        Console.WriteLine("");
        Debug.WriteLine("");

        //*****************************
        //The console is the bottle-neck, so we can
        //speed-up redrawing it by only showing 1 line at a time.
        Console.WindowHeight = 1;
        Console.WindowWidth = 50;

        Console.BufferHeight = 100;
        Console.BufferWidth = 50;
        //******************************

        Action[] actionArray = new Action[numOfTests];

        actionArray[0] = naive;
        actionArray[1] = oneCallToWrite;
        actionArray[2] = someCallsToWrite;
        actionArray[3] = inParallel;
        actionArray[4] = inParallelBlocks;
        actionArray[5] = iceManMinds;
        actionArray[6] = testTiming;


        for (int i = testStart; i < actionArray.Length; i++)
        {
            Action a = actionArray[i];
            DoTiming(a, i);
        }

        printResults();

        Console.WriteLine("");
        Debug.WriteLine("");

        Console.WriteLine("Tests complete.");
        Debug.WriteLine("Tests complete.");

        Console.WriteLine("Press Enter to Close Console...");
        Debug.WriteLine("Press Enter to Close Console...");

        Console.ReadLine();
    }

    private static void DoTiming(Action a, int num)
    {
        a.Invoke();

        Stopwatch watch = new Stopwatch();
        Stopwatch loopWatch = new Stopwatch();

        bool shouldRetry = false;

        int numOfIterations = 2;

        do
        {
            watch.Start();

            for (int i = 0; i < numOfIterations; i++)
            {
                a.Invoke();
            }

            watch.Stop();

            shouldRetry = false;

            if (watch.ElapsedMilliseconds < MinTimingVal) //if the time was less than the minimum, increase load and re-time.
            {
                shouldRetry = true;
                numOfIterations *= 2;
                watch.Reset();
            }

        } while (shouldRetry);

        long totalTime = watch.ElapsedMilliseconds;

        double avgTime = ((double)totalTime) / (double)numOfIterations;

        avgSecs[num] = avgTime / 1000.00;
        testIterations[num] = numOfIterations;
    }

    private static void printResults()
    {
        Console.WriteLine("");
        Debug.WriteLine("");

        for (int i = testStart; i < numOfTests; i++)
        {
            TimeSpan t = TimeSpan.FromSeconds(avgSecs[i]);

            Console.WriteLine("ElapsedTime: {0:N4}, " + "test: " + testNames[i], t.ToString() );
            Debug.WriteLine("ElapsedTime: {0:N4}, " + "test: " + testNames[i], t.ToString() );
        }
    }

    public static void naive()
    {
        FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.None, 8, FileOptions.None);

        using (StreamReader sr = new StreamReader(fs))
        {
            while (sr.Peek() >= 0)
            {
                 Console.WriteLine( sr.ReadLine() );

            }
        }
    }

    public static void oneCallToWrite()
    {
        FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.None, 8, FileOptions.None);

        using (StreamReader sr = new StreamReader(fs))
        {
            StringBuilder sb = new StringBuilder();

            while (sr.Peek() >= 0)
            {
                string s = sr.ReadLine();

                sb.Append("\n" + s);
            }

            Console.Write(sb);
        }
    }

    public static void someCallsToWrite()
    {
        FileStream fs = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.None, 8, FileOptions.None);

        using (StreamReader sr = new StreamReader(fs))
        {
            StringBuilder sb = new StringBuilder();
            int count = 0;
            int mod = 10000;

            while (sr.Peek() >= 0)
            {
                count++;

                string s = sr.ReadLine();

                sb.Append("\n" + s);

                if (count % mod == 0)
                {
                    Console.Write(sb);
                    sb = new StringBuilder();
                }
            }

            Console.Write( sb );
        }
    }

    public static void inParallel()
    {
        string[] wordsFromFile = File.ReadAllLines( file );

        int length = wordsFromFile.Length;

        Parallel.For( 0, length, i => {

            Console.WriteLine( wordsFromFile[i] );

        });

    }

    public static void inParallelBlocks()
    {
        string[] wordsFromFile = File.ReadAllLines(file);

        int length = wordsFromFile.Length;

        Parallel.For<StringBuilder>(0, length,
            () => { return new StringBuilder(); },
            (i, loopState, sb) =>
            {
                sb.Append("\n" + wordsFromFile[i]);
                return sb;
            },
            (x) => { Console.Write(x); }
        );

    }

    #region iceManMinds

    public static void iceManMinds()
    {
        string FileName = file;
        long ThreadReadBlockSize = 50000;
        int NumberOfThreads = 4;
        byte[] _inputString;


        var fi = new FileInfo(FileName);
        long totalBytesRead = 0;
        long fileLength = fi.Length;
        long readPosition = 0L;
        Console.WriteLine("Reading Lines From {0}", FileName);
        var threads = new Thread[NumberOfThreads];
        var instances = new ReadThread[NumberOfThreads];
        _inputString = new byte[fileLength];

        while (totalBytesRead < fileLength)
        {
            for (int i = 0; i < NumberOfThreads; i++)
            {
                var rt = new ReadThread { StartPosition = readPosition, BlockSize = ThreadReadBlockSize };
                instances[i] = rt;
                threads[i] = new Thread(rt.Read);
                threads[i].Start();
                readPosition += ThreadReadBlockSize;
            }
            for (int i = 0; i < NumberOfThreads; i++)
            {
                threads[i].Join();
            }
            for (int i = 0; i < NumberOfThreads; i++)
            {
                if (instances[i].BlockSize > 0)
                {
                    Array.Copy(instances[i].Output, 0L, _inputString, instances[i].StartPosition,
                               instances[i].BlockSize);
                    totalBytesRead += instances[i].BlockSize;
                }
            }
        }

        string finalString = Encoding.ASCII.GetString(_inputString);
        Console.WriteLine(finalString);//.Substring(104250000, 50000));
    }

    private class ReadThread
    {
        public long StartPosition { get; set; }
        public long BlockSize { get; set; }
        public byte[] Output { get; private set; }

        public void Read()
        {
            Output = new byte[BlockSize];
            var inStream = new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
            inStream.Seek(StartPosition, SeekOrigin.Begin);
            BlockSize = inStream.Read(Output, 0, (int)BlockSize);
            inStream.Close();
        }
    }

    #endregion

    public static void testTiming()
    {
        Thread.Sleep(500);
    }
}

Each of these tests print the file out to console.

When run under default Console settings, each test took between 5:30 and 6:10 (Min:Sec).

After considering the Console properties, by making Console.WindowHeight = 1, that is, only 1 line is shown at a time, (you can scroll up and down to see the most recent 100 lines), and I achieved a speed-up.

Currently, the task completes in just a little over 2:40 (Min:Sec) for most methods.

Try it out on your computer and see how it works for you.

Interestingly enough, the different methods were basically equivalent, with the OP's code being basically the fastest.

The timing code warms-up the code then runs it twice and averages the time it takes, it does this for each method.

Feel free to try out your own methods and time them.

Xantix
  • 3,321
  • 1
  • 14
  • 28
  • Thanks for the great response.I am bit tied up these days but will share the results after executing your sample code. Meanwhile I have updated the original question and made writing on the Windows Console desirable but not mandatory. – Dave Jul 20 '12 at 12:53