Sunday, March 24, 2013

Reading large files in chunks with proper encapsulation

I've been doing some work lately which involves sequentially reading large files (> 2 to 5GB). This entails that it's not an option to read the whole structure in memory; it's more reliable to process the file in chunks. I occasionally come across legacy that solves exactly this problem, but in a procedural way, resulting in tangled spaghetti. To be honest, the first piece of software I ever wrote in a professional setting also went at it in the wrong way.

There is no reason to let it come to this though; you can use the often overlooked yield return keyword to improve encapsulation.
When you use the yield keyword in a statement, you indicate that the method, operator, or get accessor in which it appears is an iterator. You consume an iterator method by using a foreach statement or LINQ query. Each iteration of the foreach loop calls the iterator method. When a yield return statement is reached in the iterator method, expression is returned, and the current location in code is retained. Execution is restarted from that location the next time that the iterator function is called.
Have a look at the following Reader class which takes advantage of yield returning. This class reads from file, line by line, building a chunk, to return it when the desired chunk size is attained. In the next iteration, the call will continue by clearing the lines - thereby releasing memory, and rebuilding the next chunk.
public class Reader
{
    private int _chunkSize;

    public Reader(int chunkSize) 
    {
        _chunkSize = chunkSize;
    }

    public IEnumerable<Chunk> Read(string path)
    {
        if (string.IsNullOrEmpty(path))
            throw new NullReferenceException("path");

        var lines = new List<string>();

        using (var reader = new StreamReader(path))
        {
            string line;
            while ((line = reader.ReadLine()) != null)
            {
                lines.Add(line);

                if (lines.Count == _chunkSize)
                {
                    yield return new Chunk(lines);

                    lines.Clear();
                }
            }                
        }

        yield return new Chunk(lines);
    }
}

public class Chunk
{
    public Chunk(List<string> lines) 
    {
        Lines = lines;
    }

    public List<string> Lines { get; private set; }
}
And that's one way to achieve clean encapsulation without starving your machine's memory.
var reader = new Reader(chunkSize: 1000);
var chunks = reader.Read(@"C:\big_file.txt");

foreach (var chunk in chunks)            
    Console.WriteLine(chunk.Lines.Count);

8 comments:

  1. Initialize your list with the chunksize (capacity). If you really want to get your hands dirty, allocate an array :-)

    ReplyDelete
  2. On a different note ... using yield, IEnumerable and a custom IEnumerator you can weave multiple enumerables or do paged/chunked reads on another resource while still keeping an endless streaming experience for the consumer. Adapter-style so to speak.

    ReplyDelete
  3. Ok and now extend this to read fast from any place in the file for example to quickly read the end of the file.

    ReplyDelete
    Replies
    1. Looking for memory mapped files? http://msdn.microsoft.com/en-us/library/dd997372.aspx

      Delete
  4. Great article!
    It has provided me with inspiration! I have q little app that reads large CSV files, going to check if I could implement something like the above in the app.

    Keep up the good work!

    ReplyDelete
  5. No idea if it can be useful for you but liked the spring batch fwk documentation while I was doing some batch processing (of larger files)

    http://static.springsource.org/spring-batch/reference/html/

    ReplyDelete