0

i'm trying to extract the text only content from a web page and displayed and i use the HtmlAgilityPack to do the text extraction but the text return with the javascript and css text and i don't want this so i'm trying to detect the { } delimiter to remove all string within the { } delimiter to delete all javascript and css text from the returned text and i use a regex to do that but is not working because i have a nested { } and this is my regex that i'm trying with :

string regex = "\t|\n|<.*?>|(\\[.*\\])|(\".*\")|('.*')|(\\(.*\\))|{\\[.*\\]}|{\".*\"}|{'.*'}|{\\(.*\\)}";
TextArea1.Value = Regex.Replace(s, regex, "");

Input Text:

Los Angeles Times - California, national and world news - Los Angeles Times;},svginImg:function;a.onload=function{var a=navigator.userAgent||navigator.vendor||window.opera;return/;},isIE9:function==9;}},notmobileCalccheck:function;a.style.cssText=;return !!a.style.length;},isAndroidBrowser:function{var a=navigator.userAgent||navigator.vendor;return/android/i.test&&!window.opera;},isSupportedBrowser:function&&!window.opera;},getScreenWidth:function;},isSupported:function isSupported{a=sessionStorage==;}else{try{a=this.supportsSvg{a=false;}}if<=8;}};trb.utils.redirect=function;b.name=;document.body.appendChild;b.submit;if{localStorage=d;}else{for{var c={};for{c;}return null;},remove:function remove;localStorage.removeItem{var b=localStorage;if;a=),f;for;}}},remove:function remove{a.trb=a.trb||{};trb.data=trb.data||{};trb.data.isMobile=trb.browsersupport.isMobile;trb.data.isIE9=trb.browsersupport.isIE9;trb.data.facebookAppId=;trb.data.parentSectionPath=);}if;}trb.data.isSectionFront=true;if;}trb.data.videos={};trb.data.videos.ndnFallbackJsURL=;trb.data.initialpathname=;trb.data.pages=trb.data.pages||{};trb.data.pages={};trb.data.pages.unsupportedBrowserPath=;trb.svg={};trb.svg.data={};trb.svg.data.svgStrings={};trb.svg.data.svgStrings.logoShort=;trb.svg.data.svgStrings.logo=;trb.svg.data.svgStrings.loadingCircle=;trb.svg.data.map={mastheadLogo:{colors:{PRIMARY_COLOR:},string:trb.svg.data.svgStrings.loadingCircle}}; { background: #404040; } .trb_allContentWrapper { background: #333; }

Fadi
  • 2,320
  • 8
  • 38
  • 77
  • 1
    Can you show the desired output as well – Arijit Mukherjee Jun 09 '14 at 05:15
  • 2
    This looks like **JavaScript** and **CSS**, and you also have **unbalanced braces** (`;}`), *and* **nested braces** (`{mastheadLogo:{colors:{PRIMARY_COLOR:}`). You should find a dedicated parser for that language. – Kobi Jun 09 '14 at 05:19
  • 1
    Showing something different rather than HTML would be less controversial... You need to be way more specific to avoid danger of closing as fuplicate of all time best [regex match open tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?rq=1)... – Alexei Levenkov Jun 09 '14 at 05:20
  • 1
    There are [many questions about matching balanced delimiters](http://stackoverflow.com/search?q=match+balanced+delimiters) here on SO. It’s harder than it appears, at least if there can be escaped or quoted delimiters, nested ones, or unbalanced ones. Although [it is not impossible](http://stackoverflow.com/a/4843579), one is forced to observe that [neither is it easy](http://stackoverflow.com/a/4234491) — nor perhaps even worthwhile. A non-HTML solution is not that hard: just keep deleting the innermost paired bits until you run out: `while (s/\{[^{}]*\}//g) { continue }` maybe; ask again. – tchrist Jun 09 '14 at 18:32

5 Answers5

2

i have been using HtmlAgilityPack to load an web page and extract the text content only so when i'm loading the page and extract the text the css and javascript text also is extracted so i try this method of regex to remove the javascript and css from the output text by detect the { } delimiter but was hard so i try anther way and it work and much simpler by using the Descendants() from HtmlAgilityPack and my code is

 HtmlWeb web = new HtmlWeb();
 HtmlDocument doc = web.Load(url);
 doc.DocumentNode.Descendants()
                            .Where(n => n.Name == "script" || n.Name == "style" || n.Name == "#comment")
                            .ToList()
                            .ForEach(n => n.Remove());

            string s = doc.DocumentNode.InnerText;
            TextArea1.Value = Regex.Replace(s, @"\t|\n|<.*?>","");

and find this from : THIS LINK

and every thing works now.

Community
  • 1
  • 1
Fadi
  • 2,320
  • 8
  • 38
  • 77
1

why dont you simply try :

/\{.*?\}/g

and replace with nothing.

aelor
  • 10,892
  • 3
  • 32
  • 48
1

You have nested braces.

In Perl, PHP, Ruby, you could match the nested braces using (?R) (recursion syntax). But .NET does not have recursion. Does this mean we are lost? Luckily, no.

Balancing Groups to the Rescue

C# regex cannot use recursion, but it has an awesome feature called balancing groups.

This regex will match complete nested braces.

(?<counter>{)(?>(?<counter>{)|(?<-counter>})|[^{}]+)+?(?(counter)(?!))

For instance, it will match

  1. {sdfs{sdfs}sd{d{ab}}fs}
  2. {ab}
  3. But not {aa
zx81
  • 41,100
  • 9
  • 89
  • 105
  • This question really isn't worth your time: it is unclear what the OP is trying to do (maybe just `^(.*?);}` ?), and the sample text is a mess. Besides, JavaScript and CSS can contain strings and comments with more braces. It's a fun exercise, but it is utterly wasted on this poor question. – Kobi Jun 09 '14 at 06:01
  • Also - A simple way to match nested structures is `(?>(?{)|(?<-counter>})|[^{}])+(?(counter)(?!))`. It may not perform as well, but it works. – Kobi Jun 09 '14 at 06:03
  • 1
    @Kobi Thanks, GREAT hearing from you, I love your regex-awesome-blog and hugely respect your skills. That being said, sorry but I don't think your regex works. Try it against `{a{bc}d{e{fg}}h} {klm} {nop} keep1=>} keep2` First off, there is no rule that it HAS TO have braces, so it will happily match ` keep2`, leaving the counter at zero. Second, it will overshoot the match, and return a single match that causes us to delete some chars we do want: `{a{bc}d{e{fg}}h} {klm} {nop} keep1=>` In contrast, my regex forces a `{` and matches correctly... until shown otherwise. :) – zx81 Jun 09 '14 at 06:09
  • 1
    - Good point! The pattern should be lazy: `(?>(?{)|(?<-counter>})|[^{}])+?(?(counter)(?!))`. If we want at least one outer braces, we can do `\{(?>(?{)|(?<-counter>})|[^{}])+?(?(counter)(?!))\}.` You are right - the question (probably) asks for that specifically, and I ignored that. Thanks! – Kobi Jun 09 '14 at 07:00
0

You want to match all case of '{' to '}' including every character which isn't '}' between the pair, then use the following:

/\{[^\}]+\}/g
whoisj
  • 398
  • 1
  • 2
  • 10
-3
int x=0, y=0;
int l=string.lastIndexOf("}");
do
{
x= string.indexof("{", x) + 1;
y= string.indexof{"}", x};
string.remove(x, y-x);
}
while(y!=l);
user2508754
  • 17
  • 1
  • 3