0

I am trying to cleanup some files I get on a quarterly basis. They have a bunch of repeating headers and I would like to replace multiple string values at a single time. I can remove one string at a time, but I am not understanding how I can stream the file and look at each line and remove if it is String 1 or String 2.

Each file has at least 100-300 thousand lines and I get between 10 and 50 files each time the data is dumped to me about once a quarter... Would be easier if they didn't add these lines, but that is not an option.

Sorry for the newbie question, but I don't get to code very often. Any help is appreciated...

static void Main(string[] args)
{
    string tempFile = Path.GetTempFileName();
    string t1 = "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------";

    string fName = "C:\\PoC\\test\\test.txt";
    using (var sr = new StreamReader(fName))
    using (var sw = new StreamWriter(tempFile))
    {
        string line;

        while ((line = sr.ReadLine()) != null)
        {

            if (line.Contains(t1) == false)
            {
                sw.WriteLine(line);
            }
        }
        sr.Close();
        sw.Close();
    }

    File.Delete(fName);
    File.Move(tempFile, fName);
}
johnny 5
  • 19,893
  • 50
  • 121
  • 195
Stu Ryan
  • 21
  • 4
  • remember that each time you read you move the stream position on the last byte read and writeline method is going to write from that position. My advice here keep an eye of start position and the new position after read a line and overwrite that range with empty spaces base on your if condition. That if you want to do it directly if not also you can construct a new file on memory and override the entire file. – mijail Aug 27 '15 at 16:17

3 Answers3

1

Calling string.Contains() is almost as expensive as calling string.Replace() because in either case the entire line must be scanned for your substring. In the case of Replace() finding a match it creates and returns a new string representing the result of the replacement, otherwise it returns the original string. Change

if (line.Contains(t1) == false)
    sw.WriteLine(line);

to

sw.WriteLine(line.Replace(t1, whatYouWantToReplaceWith));

If you are replacing multiple values in a single line, you can write

sw.WriteLine(
    line
    .Replace(t1, whatYouWantToReplaceWith1)
    .Replace(t2, whatYouWantToReplaceWith2)
    .Replace(t3, whatYouWantToReplaceWith3)
);

Note that using multiple .Replace() will cause the line to be scanned for matches multiple times. Though this reduces performance slightly, most of your processing time will probably still be file IO.

If you know that the replacement will only ever happen e.g. in the first line, you can add a counter to track what line number you are on and only apply the Replace() code to appropriate line(s).

Note that you might get some additional improvement on a large file by using a BufferedStream.

UPDATE

Based on the statement that you just want to remove the line, I suggest you go with @Eser's answer.

Community
  • 1
  • 1
Eric J.
  • 147,927
  • 63
  • 340
  • 553
  • Ok this is a really simple way of doing this... Thank you. One question, the strings just need to be removed. So what should I use to replace the line with? I tried "line" which is what worked before (honestly I don't know why). I also tried "". But that left a blank line, where I want to remove the line entirely. – Stu Ryan Aug 27 '15 at 17:00
  • Use `string.Replace(t1, "")` – Eric J. Aug 27 '15 at 17:07
  • Eric, lost me there. Are you saying line.replace to string.replace? Sorry for the dumb question. – Stu Ryan Aug 27 '15 at 17:28
  • If you want to just remove whatever is in `t1`, use `string.Replace(t1, "")`. Are you trying to remove *the entire line* when there is a match or just remove what is in `t1` from the line but keep the rest of the line? – Eric J. Aug 27 '15 at 17:32
  • Just remove the line. Sorry if I did not make that clear. – Stu Ryan Aug 27 '15 at 17:48
1

I would like to replace multiple string values at a single time.

Using Linq can make your code simpler

string[] stringsToRemove = new[] { "str1", "str2", "str2" };

var query = File.ReadLines(fName)
                .Where(line => !stringsToRemove.Any(s => line.Contains(s)));

File.WriteAllLines(tempFile, query);
Eser
  • 12,346
  • 1
  • 22
  • 32
  • He clarified in a comment that he does want to remove the entire line rather than replace tokens, so your take on the question is correct. – Eric J. Aug 27 '15 at 17:53
  • I am getting an unhandled exception error. Any thoughts?System.IO.IOException was unhandled HResult=-2147024864 Message=The process cannot access the file 'C:\Users\xxxx\AppData\Local\Temp\tmpB784.tmp' because it is being used by another process. Source=mscorlib – Stu Ryan Aug 27 '15 at 18:27
  • @StuRyan Close/Open your VS and try again.... Seems, It is not related with the code you write – Eser Aug 27 '15 at 18:30
  • @Eser Tried that... still the same exception. – Stu Ryan Aug 27 '15 at 21:04
1

I know you are working on a c# program, if the purpose is simply to remove the lines that matches the patter then I'd use something like Unix Stream editor sed command, see sed for Windows stand alone command, or cygwin and you could simply use the command to remove all lines that matches the pattern and lines following it as well, you can write a .bat script to copy, rename, remove lines that matching more than one patterns. It is really fast as well.

sed -i '/^---------/d' filepath
Ash
  • 657
  • 9
  • 16
  • This is really interesting... I will have to look into it. Thank you. – Stu Ryan Aug 27 '15 at 17:01
  • sed is a nice option if the C# program would be a stand alone thing. If it is part of a software package that is already being deployed, I would not install something extra. – Eric J. Aug 27 '15 at 17:10
  • This is just a way to simplify my life. I download the files to my computer (1-5 GB), cleanup, and then load into a database. Its ugly, but it is what it is... The files are log file dumps and my management wants a report on them. The person before me used find in notepad and spent a day or two doing this...just trying to be smart here. So this is interesting to create a bat file to take care of it... I also want to learn more coding, so both are great right now. This is step one, next is multiple files and then automate the load to the DB... – Stu Ryan Aug 27 '15 at 17:26
  • Sure for individual environments or small business. For a large enterprise, the more components you have, the more complicated it gets. Deploying them, ensuring you have latest security patches for them, etc. I'm a fan of sed, just suggesting to be mindful of whether it's the best solution for the OP. – Eric J. Aug 27 '15 at 17:30