0

I'm trying to find a way to clear out links in a .txt document loaded into the project as a string via StreamReader.

Firstly I need to identify that there is a link (it could be inside of tags, or it could just be out by itself in the middle of a sentence, like http://www.somesite.com )

I found a neat class online called GetStringInBetween which allows me to find all the links in the document. However I'm struggling in using the same class to then match both the found link(s) AND another point - I was trying to go for a linebreak so that I'm able to replace everything between a linebreak and the end of the url - effectively erasing chunks of text surrounding the url; they typically say something like "you can visit our site at http:/", etc.

What is the best way to a) identify links in an extremely long string and b) how to erase them AND some text around them?

I'd also like to note that unless I specify to use Encoding.UTF7 the text comes out all garbled when it's read from the text files. I don't know if this might be a source of the matching issues.

Thanks ladies and gents :)

dsp_099
  • 5,801
  • 17
  • 72
  • 128

1 Answers1

2

First of all - how big is the file that you're trying to parse? If it's just on the order of a few hundred MB, then you can load it in RAM entirely which simplifies things.

The UTF-7 encoding should not bother you, because all .NET strings are internally UTF-16 and .NET converts from UTF-7 to UTF-16 when reading the file so you don't have to worry about encodings anymore.

After you have it in one big string, your best bet is to proceed with using regexps on it. They allow replacing text as well, so you might be able to "clean" your file in one line of code! Of course, regexps for matching URLs will never be perfect (and even less so for parsing HTML), so you can expect that some parts of more exotic URLs might escape now and then. But if you want perfection, then it might get REALLY tricky.

Alternatively, if the file is large, and you only care about removing one line at a time, you might try reading the file line-by-line and then process each line separately. If you find and URL in it, discard the line. If there is no URL, write to target file. That's also a very simple to write. You'd still be dependent on regexps for finding URLs though.

Community
  • 1
  • 1
Vilx-
  • 104,512
  • 87
  • 279
  • 422
  • It's under 100~kb usually so I just load 'er up in memory – dsp_099 Dec 09 '11 at 20:39
  • What is a character for a newline? I've tried matching between "\r\n" and the url, as well as "\r" and "\n" separately and it didn't seem to work. – dsp_099 Dec 09 '11 at 20:39