3

I am learning LINQ, and I want to read a text file (let's say an e-book) word by word using LINQ.

This is wht I could come up with:

static void Main()
        {
            string[] content = File.ReadAllLines("text.txt");

            var query = (from c in content
                         select content);

            foreach (var line in content)
            {
                Console.Write(line+"\n");
            }

        }

This reads the file line by line. If i change ReadAllLines to ReadAllText, the file is read letter by letter.

Any ideas?

Deepak
  • 731
  • 2
  • 9
  • 14

6 Answers6

3
string[] content = File.ReadAllLines("text.txt");
var words=content.SelectMany(line=>line.Split(' ', StringSplitOptions.RemoveEmptyEntries));
foreach(string word in words)
{
}

You'll need to add whatever whitespace characters you need. Using StringSplitOptions to deal with consecutive whitespaces is cleaner than the Where clause I originally used.

In .net 4 you can use File.ReadLines for lazy evaluation and thus lower RAM usage when working on large files.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • the problem with this is that words in the next line are appended without space to the last word of the previous line. – Deepak Oct 14 '10 at 11:40
  • Why is that a problem? The ReadAllLines function should already split these apart. And then the SelectMany splits each line even further. And the Where clause deals with consecutive whitespaces. – CodesInChaos Oct 14 '10 at 11:43
  • I think I'd prefer to split on `new Regex(@"[^\w'-]")` to catch most non-word chars but keep ' and - within words intact. If you aren't in .NET 4, you can also write your own lazy-evaluated ReadLines from a TextReader as `for(string line = rdr.ReadLine(); line != null; line = rdr.ReadLine())yield return line;` – Jon Hanna Oct 14 '10 at 12:53
1
string str = File.ReadAllText();
char[] separators = { '\n', ',', '.', ' ', '"', ' ' };    // add your own
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);
Grozz
  • 8,317
  • 4
  • 38
  • 53
0

The following uses iterator blocks, and therefore uses deferred loading. Other solutions have you loading the entire file into memory before being able to iterate over the words.

static IEnumerable<string> GetWords(string path){  

    foreach (var line in File.ReadLines(path)){
        foreach (var word in line.Split(null)){
            yield return word;
        }
    }
}

(Split(null) automatically removes whitespace)

Use it like this:

foreach (var word in GetWords(@"text.txt")){
    Console.WriteLine(word);
}

Works with standard Linq funness too:

GetWords(@"text.txt").Take(25);
GetWords(@"text.txt").Where(w => w.Length > 3)

Of course error handling etc. left out for sake of learning.

Community
  • 1
  • 1
TinyTimZamboni
  • 5,275
  • 3
  • 28
  • 24
0
string content = File.ReadAllText("Text.txt");

var words = from word in content.Split(WhiteSpace, StringSplitOptions.RemoveEmptyEntries) 

select word;

You will need to define the array of whitespace chars with your own values like so:

List<char> WhiteSpace = { Environment.NewLine, ' ' , '\t'};

This code assumes that panctuation is a part of the word (like a comma).

Neowizard
  • 2,981
  • 1
  • 21
  • 39
0

It's probably better to read all the text using ReadAllText() then use regular expressions to get the words. Using the space character as a delimiter can cause some troubles as it will also retrieve punctuation (commas, dots .. etc). For example:

Regex re = new Regex("[a-zA-Z0-9_-]+", RegexOptions.Compiled); // You'll need to change the RE to fit your needs
Match m = re.Match(text);
while (m.Success)
{
    string word = m.Groups[1].Value;

    // do your processing here

    m = m.NextMatch();
}
Waleed Eissa
  • 10,283
  • 16
  • 60
  • 82
-1

You could write content.ToList().ForEach(p => p.Split(' ').ToList().ForEach(Console.WriteLine)) but that's not a lot of linq.