1

I am trying to clean up a text file so that it can be imported into Excel but the text file contains line breaks within several of the double quoted fields. The file is tab delimited.

Example would be:

"12313"\t"1234"\t"123

5679"
"test"\t"test"\t"test"
"test"\t"test"\t"test"
"12313"\t"1234"\t"123

5679"

I need to remove the line breaks so that it will ultimately display like:

"12313"\t"1234"\t"1235679"
"test"\t"test"\t"test"
"test"\t"test"\t"test"
"12313"\t"1234"\t"1235679"

The "\t" is the tab delimiter.

I've looked at several other solutions on SO but they don't seem to deal with multiple lines. We've tried using several CSV parser solutions but can't seem to get them to work for this scenario. The goal is to pass the entire string into a REGEX expression and have it return with all line breaks between quotes removed while the line breaks outside of the quotes remain.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Chason Arthur
  • 519
  • 1
  • 11
  • 22
  • 1
    Please show us the regexes you have tried and that have failed. – Kirk Woll Apr 21 '14 at 18:08
  • [Here's someone who had a similar problem](http://stackoverflow.com/q/22947602/1578604) and it turned out that the newlines within the quotes were different from those between the real lines. If you can find it out somehow using a sample within a text editor capable of showing newlines symbol, you might be able to use a very simple regex. If they are the same newline characters, then you'll have to do it in a way a bit more complicated, like anubhava's regex for example. – Jerry Apr 21 '14 at 18:41
  • @Jerry - Do you know of any text editors that will show the actual line breaks in a text file? That would be a huge help in determining if the line breaks are simply \n or something else. Thanks for your comment. – Chason Arthur Apr 21 '14 at 21:41
  • 1
    @ChasonArthur I know that Notepad++ does, apparently there's a package for [SublimeText3](http://stackoverflow.com/q/12125645/1578604) that can show them, and apparently Ctrl+`.` in [Eclipse](http://www.raymondcamden.com/index.cfm/2010/4/30/Eclipse-Tip--Whitespace-characters-visible-when-they-are-turned-off) to show them. I don't know about many editors though, but the regular notepad, wordpad and such don't support it AFAIK. – Jerry Apr 22 '14 at 05:11

5 Answers5

7

You can use this regex:

(?!(([^"]*"){2})*[^"]*$)\n+

Working Demo

This one matches one or more newline character that are not followed by even number of quotes (It assumes there is no escaping exceptions in the data).

anubhava
  • 761,203
  • 64
  • 569
  • 643
1

This worked for me:

var fixedCsvFileContent = Regex.Replace(csvFileContent, @"(?!(([^""]*""){2})*[^""]*$)\n+", string.Empty);

This didnt work:

var fixedCsvFileContent = Regex.Replace(csvFileContent, @"(?!(([^""]*""){2})*[^""]*$)\n+", string.Empty, RegexOptions.Multiline);

Thus one must not add RegexOptions.Multiline when doing the check on the input string.

0

If just removing blank lines works:

string text = Regex.Replace( inputString, @"\n\n", "" , RegexOptions.None | RegexOptions.Multiline );
Derek
  • 7,615
  • 5
  • 33
  • 58
0

I have been running into a similar problem, but also some of the files might be really large. So using a RegEx on everything would be a heavy solution, and instead I wanted to try to make something a bit like ReadLine except that it would ignore breaklines within quotes. This is the solution I am using.

It is an extension to the StreamReader class, used to reading the CSV files and like some of the RegEx solutions here, it ensures there is an even number of quotes. So it uses ReadLine, checks if there is an odd number of quotes and if there is it does another ReadLine until the number of quotes is even:

    public static class Extensions
{
    public static string ReadEntry(this StreamReader sr)
    {
        string strReturn = "";
        //get first bit
        strReturn += sr.ReadLine();

        //And get more lines until the number of quotes is even
        while (strReturn.GetNumberOf("\"").IsOdd())
        {
            string strNow = sr.ReadLine();
            strReturn += strNow;
        }

        //Then return what we've gotten
        if (strReturn == "")
        {
            return null;
        }
        else
        {
            return strReturn;
        }
    }

    public static int GetNumberOf(this string s, string strSearchString)
    {
        return s.Length - s.Replace(strSearchString, "").Length;
    }

    public static Boolean IsOdd(this int i)
    {
        return i % 2 != 0;
    }
}
Kasper Olesen
  • 126
  • 1
  • 11
-1
string output = Regex.Replace(input, @"(?<=[^""])\r\n", string.Empty);

Demo with the input provided

Ulugbek Umirov
  • 12,719
  • 3
  • 23
  • 31