how to clean a dirty csv string using php regex

Question

my string may be like this:

@ *lorem.jpg,,, ip sum.jpg,dolor ..jpg,-/ ?

in fact - it is a dirty csv string - having names of jpg images

I need to remove any non-alphanum chars - from both sides of the string
then - inside the resulting string - remove the same - except commas and dots
then - remove duplicates commas and dots - if any - replace them with single ones

so the final result should be:
lorem.jpg,ipsum.jpg,dolor.jpg

I firstly tried to remove any white space - anywhere

$str = str_replace(" ", "", $str);

then I used various forms of trim functions - but it is tedious and a lot of code

the additional problem is - duplicates commas and dots may have one or more instances - for example - .. or ,,,,

is there a way to solve this using regex, pls ?

Is this helpful : https://stackoverflow.com/questions/659025/how-to-remove-non-alphanumeric-characters — SelVazi, Jan 17 '23 at 10:16
Once you removed the spaces, the regular expression `(\w+\.\w+)` should be enough to extract all the file names using preg_match_all. You can then use implode to join those results with a comma between them. — CBroe, Jan 17 '23 at 10:16
@CBroe - interesting, thanks, I will try. But I suppose duplicates commas and dots are still the problem — provance, Jan 17 '23 at 10:20
Can you try this $result = preg_replace("/[^A-Za-z0-9,.]/", '', $str); — SelVazi, Jan 17 '23 at 10:22
@SelVazi - it works except last comma - but I can remove it by `rtrim`. But it does not remove duplicates commas and dots — provance, Jan 17 '23 at 10:28

Diego D · Accepted Answer · 2023-01-17T13:10:18.523

List of modeled steps following your words:

Step 1

"remove any non-alphanum chars from both sides of the string"
translated: remove trailing and tailing consecutive [^a-zA-Z0-9] characters
regex: replace ^[^a-zA-Z0-9]*(.*?)[^a-zA-Z0-9]*$ with $1

Step 2

"inside the resulting string - remove the same - except commas and dots"
translated: remove any [^a-zA-Z0-9.,]
regex: replace [^a-zA-Z0-9.,] with empty string

Step 3

"remove duplicates commas and dots - if any - replace them with single ones"
translated: replace consecutive [,.] as a single instance
regex: replace (\.{2,}) with .
regex: replace (,{2,}) with ,

PHP Demo:

https://onlinephp.io/c/512e1

<?php

$subject = " @ *lorem.jpg,,, ip sum.jpg,dolor ..jpg,-/ ?";

$firstStep = preg_replace('/^[^a-zA-Z0-9]*(.*?)[^a-zA-Z0-9]*$/', '$1', $subject);
$secondStep = preg_replace('/[^a-z,A-Z0-9.,]/', '', $firstStep);
$thirdStepA = preg_replace('(\.{2,})', '.', $secondStep);
$thirdStepB = preg_replace('(,{2,})', ',', $thirdStepA);

echo $thirdStepB; //lorem.jpg,ipsum.jpg,dolor.jpg

I like to take care of those details. It also helps me improving. Glad it helped and thanks for pointing out the "decoration" aspect — Diego D, Jan 17 '23 at 13:11

SelVazi · Answer 2 · 2023-01-17T10:53:31.293

1

Can you try this :

$string = ' @ *lorem.jpg,,,,  ip sum.jpg,dolor .jpg,-/ ?';
// this will left only alphanumirics
$result = preg_replace("/[^A-Za-z0-9,.]/", '', $string);

// this will remove duplicated dot and ,
$result = preg_replace('/,+/', ',', $result);
$result = preg_replace('/\.+/', '.', $result);

// this will remove ,;. and space from the end
$result = preg_replace("/[ ,;.]*$/", '', $result);

edited Jan 17 '23 at 10:53

answered Jan 17 '23 at 10:35

SelVazi

10,028
2
13
29

tried, works - except duplicates commas and dots – provance Jan 17 '23 at 10:42
I made a little update to remove duplicated commas and dots can you try it – SelVazi Jan 17 '23 at 10:53

score 1 · Answer 3 · answered Jan 17 '23 at 10:36

Look at

https://www.php.net/manual/en/function.preg-replace.php

It replace anything inside a string based on pattern. \s represent all space char, but care of NBSP (non breakable space, \h match it )

Exemple 4

$str = preg_replace('/\s\s+/', '', $str);

It will be something like that