6

I have a csv file where some of the cells have newline character inside. For example:

id,name 
01,"this is
with newline"
02,no newline 

I want to remove all the newline characters inside cells.

How to do it with regex or with other terminal tools generically without knowing number of columns in advance?

Mert Nuhoglu
  • 9,695
  • 16
  • 79
  • 117
  • Is it possible to tweak the process creating the file to strip newlines from the columns *before* the file is actually created/weitten? – code_dredd Nov 30 '15 at 08:36
  • Sure, @ray if it works – Mert Nuhoglu Nov 30 '15 at 08:40
  • What do you mean "if it works"? It would work if you update the process creating the files. You'd want to remove newline chars from each column in a row before the row gets written to the file by the program. – code_dredd Nov 30 '15 at 08:51
  • I just wanted to mean "if it works, it is ok sure". Please write your answer as code. This will be easier to test. – Mert Nuhoglu Nov 30 '15 at 08:54
  • Mind posting the code you've written so far, meaning the code responsible for creating the files in the first place? That would be a good starting point. – code_dredd Nov 30 '15 at 09:09
  • The code responsible for creating the files is not really related to the solution. It uses `xlsx2csv` tool to convert xlsx files to csv. – Mert Nuhoglu Nov 30 '15 at 09:29
  • Mert - @ray is suggesting you fix this `xlsx2csv` tool to not include newlines in the first place. Can your CSV file include escaped double quotes inside a cell? If so are they escaped by doubling them `"foo""bar"` or puting a backslash in front of them `"foo\"bar"`? – Ed Morton Nov 30 '15 at 16:06
  • @EdMorton is correct. That's what I had in mind. However, I've since posted a working solution to this problem, so it's probably a moot point now. – code_dredd Nov 30 '15 at 20:11
  • @MertNuhoglu: I wrote the answer in code. You should look at it. – code_dredd Dec 01 '15 at 06:34

6 Answers6

6

This is actually a harder problem than it looks, and in my opinion, means that regex isn't the right solution. Because you're dealing with quoting/escaped strings, spanning multiple 'lines' you end up with a complicated and difficult to read regex. (It's not impossible, it's just messy).

I would suggest instead - use a parser. Perl has one in Text::CSV and it goes a bit like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Text::CSV;

my $csv = Text::CSV->new( { binary => 1, eol => "\n" } );

while ( my $row = $csv->getline( \*ARGV ) ) {
    s/\n/ /g for @$row;
    $csv->print( \*STDOUT, $row );
}

This will take files as piped in/specified on command line - that's what \*ARGV does - it's a special file handle that lets you do ... basically what sed does:

somecommand.sh | myscript.pl
myscript.pl filename_to_process

The ARGV filehandle doe either automagically. (You could explicitly open a file or use \*STDIN if you prefer)

Sobrique
  • 52,974
  • 7
  • 60
  • 101
3

I suspect that instead of removing the newline you actually want to replace it with a space. If your input file is as simple as it looks this should do it for you:

$ awk '{ORS=( (c+=gsub(/"/,"&"))%2 ? FS : RS )} 1' file
id,name
01,"this is with newline"
02,no newline
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
2

If you are using this xlsx2csv tool, it has this option:

-e, --escape          Escape \r\n\t characters

Use it, and then replace \n as needed, like (if \n should be replaced by the empty string):

sed 's/\\n//g' filein.csv` > fileout.csv

In one pass:

PATH/TO/xlsx2csv.py -e filein.xlsx | sed 's/\\n//g' > fileout.csv
Walter Tross
  • 12,237
  • 2
  • 40
  • 64
1

How to do it with regex or with other terminal tools generically without knowing number of columns in advance?

I don't think a regex is the most appropriate approach and might end up being quite complicated. Instead, I think a separate program to process the files might be easier to maintain in the long-term.

Since you're OK with any terminal tools, I've chosen python, and the code's below:

#!/usr/bin/python3 -B

import csv
import sys

with open(sys.argv[1]) as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        stripped = [col.replace('\n', ' ') for col in row]
        print(','.join(stripped))

I think the code above is very straightforward and easy to understand, without a need for complicated regular expressions.

The input file here has the following contents:

id,name
01,"this is
with newline"
02,no newline

To prove it works, its output is reproduced below:

➜  ~  ./test.py input.csv
id,name
01,this is with newline
02,no newline

You could call the python script from some other program and feed filenames to it. You just need to add a minor update for the python program to write out files, if that's what you really need.

I've replaced the newlines with spaces to avoid a potentially unwanted concatenation (e.g. this iswith newline), but you can replace the newline with whatever you want, including the empty string ''.

code_dredd
  • 5,915
  • 1
  • 25
  • 53
0

I have written a method to remove the embedded new line inside the cell. The method below returns a java.util.List object that contains all rows in the CSV file

List<String> getAllRowsInCSVFileAsList(File selectedCSVFile){
  FileReader fileReader = null;
  BufferedReader reader = null;
  List<String> values = new ArrayList<String>();
  try{
      fileReader = new FileReader(selectedCSVFile);
      reader = new BufferedReader(fileReader);
      String line = reader.readLine();
      String previousLine = "";
      //
      boolean intendLineInCell = false;
      while(line != null){
          if(intendLineInCell){
              if(line.indexOf("\"") != -1 && line.indexOf("\"") == line.lastIndexOf("\"")){
                  previousLine += line;
                  values.add(previousLine);
                  previousLine = "";
                  intendLineInCell = false;
              } else if(line.indexOf("\"") != -1 && line.indexOf("\"") != line.lastIndexOf("\"")){
                  if(getTotalNumberOfCharacterSequenceOccurrenceInString("\"", line) % 2 == 0){
                      previousLine += line;
                  }else{
                      previousLine += line;
                      values.add(previousLine);
                      previousLine = "";
                      intendLineInCell = false;
                  }
              } else{
                  previousLine += line;
              }
          }else{
              if(line.indexOf("\"") == -1){
                  values.add(line);
              }else if ((line.indexOf("\"") == line.lastIndexOf("\"")) && line.indexOf("\"") != -1){
                  intendLineInCell = true;
                  previousLine = line;
              }else if(line.indexOf("\"") != line.lastIndexOf("\"") && line.indexOf("\"") != -1){
                  values.add(line);
              }
          }
          line = reader.readLine();
      }
  }catch(IOException ie){
      ie.printStackTrace();
  }finally{
      if(fileReader != null){
          try {
              fileReader.close();
          } catch (IOException e) {
              e.printStackTrace();
          }
      }
      if(reader != null){
          try {
              reader.close();
          } catch (IOException e) {
            e.printStackTrace();
          }
      }
  }
  return values;

}

int getTotalNumberOfCharacterSequenceOccurrenceInString(String characterSequence, String text){
  int count = 0;
  while(text.indexOf(characterSequence) != -1){
      text = text.replaceFirst(characterSequence, "");
      count++;
  }
  return count;

}

Imagine you are creating a csv file with one row and five columns and in the 4th cell you have an embedded new line(enter inside the cell)

Your data will be look like below (actually we have only one row in csv but if you opened it in notepad it would look like 2 rows).

dinesh,kumar,24,"23 
tambaram india",green

If there is a enter inside the cell could be like below

"23
tambaram india"

That cell starts with double quote(") and ends with double quote(").

Through using the double quote(") while reading the line if there is a double quote(") we can understand there is a embedded enter inside the cell.

The code concats the next line with that line and checks whether there is an end double quote(") or not. If there is, it adds a new row in the java.util.List object else it concats the next line and check it for end double quote(") and so on. Here I have explained for one cell, but the method also works if the row has a lot of cells with embedded enter.

Robert Longson
  • 118,664
  • 26
  • 252
  • 242
-2

Open the *csv file with notepadd++ and then press Ctrl+ H. Go to tab replace and enter to search box the "newline" and then write to replace the word you want to replace or let it empty if you want.

Lefteris Bab
  • 787
  • 9
  • 19
  • I want to automate the process. Therefore I want to use only terminal tools. – Mert Nuhoglu Nov 30 '15 at 08:41
  • How you create csv file ? With Java or by hand? – Lefteris Bab Nov 30 '15 at 08:42
  • 1
    @LefterisBab: He wants to automate the process. If the files were being created "by hand" he would presumably be removing the unwanted newline chars manually before saving the files. He also mentioned he wanted to use terminal tools only, so your suggestion to startup Notepad++ or some other non-shell program is not useful. – code_dredd Nov 30 '15 at 08:52
  • So @ray This is already in stackoverflow. http://stackoverflow.com/questions/1251999/how-can-i-replace-a-newline-n-using-sed – Lefteris Bab Nov 30 '15 at 08:58
  • 1
    No @LefterisBab that solution will replace all newline characters whether it is inside a cell or not. – Mert Nuhoglu Nov 30 '15 at 09:02