1

I am trying to figure out the regex expression and using PHP to fix a csv file. The format of the csv file should be like this:

id, text field1, text field2
1,some text,another text
2,some text,another text
3,some text,another text

Problem is, sometimes, the "text field1" has a new line character in it, and the content of the csv files becomes like this:

id, text field1, text field2
1,some text,another text
2,some 
text,another text
3,some text,another text

As a result, when reading the file line by line, or exploding using new line to fill an array line by line, I get invalid records. I fixed the above problem using this code:

<?php
$c= file_get_contents($myFile);
$c= preg_replace( '/\n([^0-9])/is', "\n~~$1", $c );
$c= str_replace( "\n~~", " ", $c );
?>

The above regex checks if character after a newline is not numeric, then it adds a ~~ sign there, and I replace the newline and ~~ sign with a white space.

Problem is, sometimes, the "text field1" has numbers in it, and newline appears just before them, like this:

id, text field1, text field2
1,some text,another text
2,some 
1999-06-21 text,another text
3,some text,another text

My regex is not putting a ~~ sign in this case and I get a broken csv file.

How can I modify this above regex to fix this problem? I was hoping for a regex where it can check for "non numeric value (any length) followed by comma sign". Please note that I got over 100000 records in csv file, so if using regex method, then regex must check for non-numeric value of any length.

Something like this:

$c= preg_replace( '/\n([^0-9]*\,)/is', "\n~~$1", $c);

But this above regex does not work, and I don't know that much regex to get this working.

If there is any other way of fixing this csv file, even that is ok. It need not be a regex solution.

Thanks

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
John
  • 33
  • 5
  • Even checking for a numeric value won't necessarily work, what if the value is `4, i am number\n4, some other text`? – Nick May 19 '23 at 08:01
  • Can you be certain the text values don't contain commas? – Nick May 19 '23 at 08:14
  • Hi Nick, some records have a comma sign inside the text field also. These text fields are written inside quotes, but that is not my worry. I coded a function to read the file character by character, made sure only those comma are seen as delimeter field which are not inside quotes. My bigger problem is when the line itself breaks into two separate lines. – John May 19 '23 at 12:08
  • Weird. Normally if escaping is working properly, newlines would also end up being inside quotes as well as commas. – Nick May 19 '23 at 13:58

2 Answers2

2

You can use negative lookahead for this occasion: \n(?!\d+,) (with replacement , to skip to substitutions).

It matches any newline, that is not immediately followed by number,.

In result your code would look like this:

<?php
$c= file_get_contents($myFile);
$c= preg_replace( '/\n(?!\d+,)/g', " ", $c );
?>

Demo here.

It is important to notice, that this solution is still susceptible to error, in case if newline in your cell is immediately followed by number and comma, but to correct this problem in depth, will need to know exact expected format of line and check it, instead of just newline.

markalex
  • 8,623
  • 2
  • 7
  • 32
  • Hi, yes, I got about 9 records in the entire 100000 record file which have a digit followed by comma in the text field, like: 18951,"some \n205, text here","last field of text" I have upvoted your answer. Your regex is simple and good. I wanted to accept both the answers given in this question, but can't do that. I deeply appreciate your help. – John May 19 '23 at 12:15
2

I wouldn't read line by line, as it is totally right that a string can have some new lines in it. But what's missing in this CSV file is some double quotes around the strings. What happens if text field1 or text field2 has a comma inside it? It will break your CSV even more.

As you cannot correct the CSV source to output correctly, we could split the content based on the commas (if we consider you haven't got some in the text). As you know that we only have 3 fields, we know that it should ideally match with something like this:

(\d+),([^,]*),([^,]*?)(?:\r?\n|$)

You can test it here: https://regex101.com/r/YLnNOY/2

In PHP:

<?php

$regex = '/(\d+),([^,]*),([^,]*?)(?:\r?\n|$)/';

$wrong_csv = 'id, text field1, text field2
1,some text,another text
2,some text,another text
3,some text,another text
1,some text,another text
2,some 
text,another text
3,some text,another text
1,some text,another text
2,some 
1999-06-21 text,another text
3,some text,another text';

$replacement = '$1, "$2", "$3"' . "\n";

$corrected_csv = preg_replace(
    $regex,
    $replacement,
    $wrong_csv
);

print $corrected_csv;

You can run the PHP demo here: https://onlinephp.io/c/8aa48

Patrick Janser
  • 3,318
  • 1
  • 16
  • 18
  • Hi, I have accepted your answer. Thankyou so much for helping with this. Deeply appreciated. – John May 19 '23 at 12:10