1

I have list and text file and I want:

  1. Find all list items that are also in string (matched words) and store them in list or array
  2. Replace all the found matched words with "Names"
  3. Count the matched words

Code:

string[] Names = new string[] { "SNOW","Jhon Snow","ADEMS","RONALDO",
"AABY", "AADLAND", "ANGE", "GEEN", "KHA", "AN", "ANG", "EE", "GEE", "HA", "HAN", "KHAN", 
"LA", "LAN", "LAND", "NG", "SA", "SAN", "SANG", "LAN","HAN", "LAN", "SANG", "SANG",
"Sangeen Khan"};

string Text = "I am Sangeen Khan and i am friend AABY. Jhon is friend of AABY.
AADLAND is good boy and he never speak lie. AABY is also good. SANGEEN KHAN is my name.";

List<string> matchedWords = Names.Where(Text.Contains).ToList();  
matchedWords.ForEach(w => Text = Regex.Replace(Text, "\\b" + w + "\\b", 
"Names", RegexOptions.IgnoreCase));
int numMatchedWords = matchedWords.Count;

Console.WriteLine($"Matched Words: {string.Join(",", matchedWords.ToArray())}");
Console.WriteLine($"Count: {numMatchedWords}");
Console.WriteLine($"Replaced Text: {Text}");

Output:

Matched Words: AABY, AADLAND, ANGE, GEEN, KHA, AN, ANG, EE, GEE, HA, HAN, KHAN, LA, LAN, LAND, NG, SA, SAN, SANG, LAN, HAN, LAN, SANG, SANG, Sangeen Khan

Replaced Text:I am Sangeen Names and i am friend Names. Jhon is friend of Names. Names is good boy and he never speak lie. Names is also good. SANGEEN Names is my name.

Count: 25

Problems: the code find the "Matched Words" and Number of Replacement (Count) incorrect. However, the replacement is corrected after reading String compare C# - whole word match

My desired output would be:

Matched Words: Sangeen Khan, AABY, KHAN, AADLAND.

Replaced Text: I am Names and i am friend Names. jhon is friend of Names. Names is good boy and he never speak lie. Names is also good. Names KHAN is my name.

Count: 7

Patrick
  • 5,526
  • 14
  • 64
  • 101
  • 1
    Why do you include "LAND","LAND","SANG", "jh", "han", "ngee" in the names list, if you don't want to search for them? – enkryptor Aug 23 '17 at 16:39
  • Possible duplicate of [String compare C# - whole word match](https://stackoverflow.com/questions/3904645/string-compare-c-sharp-whole-word-match) –  Aug 23 '17 at 16:50
  • That is Already in the list: as i just give an example, in real the Names comes from the database, which are about million items. – Sangeen Khan Aug 23 '17 at 16:50
  • How about `matchedWords.ForEach(w => Text = Regex.Replace(Text, "\\b" + w + "\\b", "Names", RegexOptions.IgnoreCase));` – Alex K. Aug 23 '17 at 16:52
  • If your list of hits to find is corrupted with bad entries `"LAND","LAND","SANG", "jh", "han", "ngee"`, then you have to expect bad results. Or you need to cleanse the list before you start using it. Garbage in -> Garbage out – blaze_125 Aug 23 '17 at 16:52
  • I am trying to Replace all Names in the Text. LAND is a sir name, but in my case the text not contains any word that is LAND, mistakenly it deal sub strings. – Sangeen Khan Aug 23 '17 at 16:57
  • There is problems with your samples. You said the output names have "SANGEEN", but in Names list this string does not appear. (Remembering that C# string comparison is CASE SENSITIVE). Other problem is in your output string you have: "... SNamesEN NamesN..." to this be possible you needed to have the strings: "ANGE", "KHA" in Names list, but you haven't. Please check your question for this problems. – Jonny Piazzi Aug 23 '17 at 16:59
  • @JonnyPiazzi Thanks for correcting me, yes you are right. let me correct it – Sangeen Khan Aug 23 '17 at 17:10
  • @AlexK. I think it corrected the replacement but there is yet problem in MatchedList and Count. – Sangeen Khan Aug 23 '17 at 17:15
  • I suggest you place your code in a C# runner and update in your question the correct output. – Jonny Piazzi Aug 23 '17 at 17:28

3 Answers3

1

The problem you face is replacement step by step. Let me explain. Let say you have this values:

string[] Names = { "Khan", "se" };
string Text = "Senator Khane";

If you run your code with these inputs will get:

"Senator NameNames"

Let analize the problem step by step. First let talk about case sensitivity. C# is, by default, case sensitive, this means that "Se" is different from "se". This is why the word "Senator" wasn't replaced in any point.

The other problem is "NameNames" part. Let's decompose the execution plan:

First

Text = Text.Replace("Khan");

Which set Text to value: "Senator Namese". The next forEach step was:

Text = Text.Replace("se");

So you see that the 's' of Names plus 'e' from Khane formed a actual valid pattern point, that in this case, will be replaced, forming the unwanted "NameNames".

Now that we understand the problem with your code lets us fix it.

.Net Framework already has a class that do this kind of replacement for us. Is called:

System.Text.RegularExpressions.Regex

To use it will need to create a regex pattern before. I'll not enter deeply into regex patterns constructions, so google up if you needed, is a super common talked subject in many foruns.

var names = new string[] { "SNOW","Jhon Snow","ADEMS","RONALDO",
    "AABY", "AADLAND", "ANGE", "GEEN", "KHA", "AN", "ANG", "EE", "GEE", "HA", "HAN",
    "KHAN", "LA", "LAN", "LAND", "NG", "SA", "SAN", "SANG", "LAN",
    "HAN", "LAN", "SANG", "SANG", "Sangeen Khan" };

var text = "I am Sangeen Khan and i am friend AABY. Jhon is friend of AABY. " +
    "AADLAND is good boy and he never speak lie. " +
    "AABY is also good. SANGEEN KHAN is my name.";

var letter = new Regex(@"(?<letter>\W)");

var pattern = string.Join("|", names
    .Select(n => $@"((?<=(^|\W)){letter.Replace(n, "[${letter}]")}(?=($|\W)))"));

var regex = new Regex(pattern);

var matchedWords = regex
    .Matches(text)
    .Cast<Match>()
    .Select(m => m.Value)
    //.Distinct()
    .ToList();

text = regex.Replace(text, "Names");

Console.WriteLine($"Matched Words: {string.Join(", ", matchedWords.Distinct())}");
Console.WriteLine($"Count: {matchedWords.Count}");
Console.WriteLine($"Replaced Text: {text}");

I wrote this code without any VS or VS Code or Linqpad so if has some problem please let me know. (Later tonight I will check it myself.).

Jonny Piazzi
  • 3,684
  • 4
  • 34
  • 81
0

It's a good idea to prioritize longer matches. Also, definitely sanitize/standardize your names.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

namespace Rextester
{
    public class Program
    {
        public static void Main(string[] args)
        {
            string[] Names = new string[] { "Sangeen Khan", "AABY","AADLAND","LAND","LAND","SANG",
            "jh", "han", "ngee","SNOW","Jhon Snow","ADEMS","RONALDO"};
            //Names = Standardize(Names);

            string Text = @"I am Sangeen Khan and I am friend of AABY. Jhon is also friend of AABY.
            AADLAND is good boy and he never speak lie. AABY is also good. SANGEEN KHAN is my name.";
            //Text = Standardize(Text);

            List<string> matchedWords = Names.Where(Text.Contains).OrderBy(x => x.Length).Reverse().ToList(); //Prioritize longer matches... 
            matchedWords.ForEach(w => Text = Text.Replace(w, "Names")); //By replacing longer matched names first
            //listBox2.DataSource = matchedWords;
            int numMatchedWords = matchedWords.Count;

            Console.WriteLine("Matched Words: " + matchedWords.Aggregate((i, j) => i + " " + j));
            Console.WriteLine("Count: " + numMatchedWords);
            Console.WriteLine("Replaced Text: " + Text);
        }
    }
}
C. McCoy IV
  • 887
  • 7
  • 14
0

This would only work on "whole" words:

string[] Names = new string[] { "Sangeen Khan", "AABY","AADLAND","LAND","LAND","SANG",
"jh", "han", "ngee","SNOW","Jhon Snow","ADEMS","RONALDO"};

string Text = "I am Sangeen Khan and I am friend of AABY. Jhon is also friend of AABY. AADLAND is good boy and he never speak lie.AABY is also good. SANGEEN KHAN is my name.";

string replace = "Names";
foreach(var name in Names)
{
    string pattern = @"\b" + name + @"\b";
    Text = Regex.Replace(Text, pattern, replace);
}
Console.WriteLine(Text);

Output:

I am Names and I am friend of Names. Jhon is also friend of Names. Names is good boy and he never speak lie.Names is also good. SANGEEN KHAN is my name.

Have in mind this is case-sensitive! In order to make it case insensitive, the pattern should be as follows:

string pattern = @"(?i)\b" + name + @"\b";

Output for case insensitive:

I am Names and I am friend of Names. Jhon is also friend of Names. Names is good boy and he never speak lie.Names is also good. Names is my name.

Hubbs
  • 163
  • 11
  • The problem is in matchedList and Count – Sangeen Khan Aug 23 '17 at 18:06
  • @SangeenKhan When you're iterating (looping) to Replace, make a copy of the text: string tempCopy = Text; then do the Replace and after that simply check if(tempCopy != Text) { if(!matchedList.Contains(name)) matchedList.Add(name); } – Hubbs Aug 23 '17 at 18:18