how to read the text word by word

Question

I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?

for (int i = 0; i < text.Length; i++)
{}

You need a way of delimiting words within your file. Whitespace would potentially work, but i can see issues with punctuation etc... — DGibbs, Mar 05 '13 at 17:05
Use regular expressions to match on a pattern which presents a word. Then search the match char by char — Alan, Mar 05 '13 at 17:06
What do you class as a word? Specifically when look at an html file? — Ash Burlaczenko, Mar 05 '13 at 17:06
Yes that's what I was thinking about whitespaces, but it's getting harder when working with html files. So, I thought may be somebody have better solution — Hurrem, Mar 05 '13 at 17:07
define "word", what about hyphens, apostrophes and other non letters? — Jodrell, Mar 05 '13 at 17:08
@Alan i wouldn't recommend [parsing a .htm file with a regular expression.](http://stackoverflow.com/a/1732454/1895201). — DGibbs, Mar 05 '13 at 17:08
read all the file then use split with rules you choose, you may need some regular expressions. — Alaa Jabre, Mar 05 '13 at 17:08
This is called Tokenization: http://en.wikipedia.org/wiki/Tokenization. It can be a complex subject, depending on your source input. — Polyfun, Mar 05 '13 at 17:08
Word is regular word that we use to express our mind :) So I want to get the text word by word without thouse tags. I"m not searching, I just convert the text to something else in that file. — Hurrem, Mar 05 '13 at 17:09
If you want to get the text content of an html stream you will need an HTML parser, don't try and do this with a regex. — Jodrell, Mar 05 '13 at 17:10
http://htmlagilitypack.codeplex.com/ is a good HTML parser to use with .Net — Jodrell, Mar 05 '13 at 17:13
@DGibbs Why not? He also said he wants to parse a text file, which a regular expression should be fine for. Depending on how he defines a word, it may work for both. He's not trying to parse a HTM file, he is trying to search for characters in a "word" These are different things — Alan, Mar 05 '13 at 17:24
@Alan It would probably work fine for a text file, but i think it's safe to assume that his .htm file contains HTML markup, which would become very tricky to parse with a regular expression. — DGibbs, Mar 05 '13 at 17:25

score 5 · Answer 1 · edited Nov 28 '17 at 17:35

A simple approach is using string.Split without argument(splits by white-space characters):

using (StreamReader sr = new StreamReader(path)) 
{
    while (sr.Peek() >= 0) 
    {
        string line = sr.ReadLine();
        string[] words = line.Split();
        foreach(string word in words)
        {
            foreach(Char c in word)
            {
                // ...
            }
        }
    }
}

I've used StreamReader.ReadLine to read the entire line.

To parse HTML i would use a robust library like HtmlAgilityPack.

score 2 · Answer 2 · answered Mar 05 '13 at 17:08

You can split the string on whitespace, but you will have to deal with punctuation and HTML markup (you said you were working with txt and htm files).

string[] tokens = text.split(); // default for split() will split on white space
foreach(string tok in tokens)
{
    // process tok string here
}

K. R. · Answer 3 · 2014-05-12T15:41:18.783

Here's my implementation of lazy extension to StreamReader. The idea is not to load the entire file into memory especially if your file is a single long line.

public static string ReadWord(this StreamReader stream, Encoding encoding)
{
    string word = "";
    // read single character at a time building a word 
    // until reaching whitespace or (-1)
    while(stream.Read()
       .With(c => { // with each character . . .
            // convert read bytes to char
            var chr = encoding.GetChars(BitConverter.GetBytes(c)).First();

            if (c == -1 || Char.IsWhiteSpace(chr))
                 return -1; //signal end of word
            else
                 word = word + chr; //append the char to our word

            return c;
    }) > -1);  // end while(stream.Read() if char returned is -1
    return word;
}

public static T With<T>(this T obj, Func<T,T> f)
{
    return f(obj);
}

to use simply:

using (var s = File.OpenText(file))
{
    while(!s.EndOfStream)
        s.ReadWord(Encoding.Default).ToCharArray().DoSomething();
}

score 0 · Answer 4 · answered Mar 05 '13 at 17:07

0

use text.Split(' ') to split it by space into an array of words then iterate through that.

So

foreach(String word in text.Split(' '))
   foreach(Char c in word)
      Console.WriteLine(c);

answered Mar 05 '13 at 17:07

mdubez

3,024
1
17
10

score 0 · Answer 5 · answered Mar 05 '13 at 17:07

0

You could split on whitespaces:

string[] words = text.split(' ')

will give you an array of words, then you can iterate across them.

foreach(string word in words)
{
    word // do something with each word
}

answered Mar 05 '13 at 17:07

Steve's a D

3,801
10
39
60

score 0 · Answer 6 · answered Mar 05 '13 at 17:07

0

I think you can use split

         var  words = reader.ReadToEnd().Split(' ');

or use

foreach(String words in text.Split(' '))
   foreach(Char char in words )

answered Mar 05 '13 at 17:07

score 0 · Answer 7 · edited May 23 '17 at 11:51

You can get all the text from some HTML with the HTMLAgilityPack. If you think this is overkill look here.

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    var nodeText = node.InnerText;
}

then you can split each nodes text contents into words, once you define what a word is.

Maybe like this,

using HtmlAgilityPack;

static IEnumerable<string> WordsInHtml(string text)
{
    var splitter = new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*");

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(text);

    foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
    {
        foreach(var word in splitter.Split(node.InnerText)
        {
            yield return word;
        }
    }
}

Then, to examine the chars in each word

foreach(var word in WordsInHtml(text))
{
    foreach(var c in word)
    {
        // a enumeration by word then char.
    }
}

score 0 · Answer 8 · answered Mar 05 '13 at 17:39

What's about regexps?

using System;
using System.Linq;
using System.Text.RegularExpressions;

namespace ConsoleApplication58
{
    class Program
    {
        static void Main()
        {
            string input =
                @"I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?";
            var list = from Match match in Regex.Matches(input, @"\b\S+\b")
                       select match.Value; //Get IEnumerable of words
            foreach (string s in list) 
                Console.WriteLine(s); //doing something with it
            Console.ReadKey();
        }
    }
}

it works with any delimeters and it's the fastest way to do it afaik.

how to read the text word by word

8 Answers8