11

The app basically works like this:

1) The user uploads a CSV file.

2) The file is catched by PHP via POST.

3) I open the file with fopen() and read the file with fgetcsv().

The first column it always have the \ufeff char. I know that is called UTF-8 BOM, and it's generated by Microsoft Excel. But, when I want to remove that, I can't.

I've tried: str_replace('\ufeff', '', $columns[0]);

Angel Luis
  • 487
  • 2
  • 5
  • 19
  • `'\ufeff'` is not a valid C escape, and wouldn't work in single quotes anyway. – mario Jan 11 '19 at 10:57
  • `0xFEFF` is the UTF-16 big-endian byte order marker ([reference](https://en.wikipedia.org/wiki/Byte_order_mark#Byte_order_marks_by_encoding)). It doesn't make much sense to just strip it... :-? – Álvaro González Jan 11 '19 at 10:59
  • @ÁlvaroGonzález I don't understand at all why I can't strip that. – Angel Luis Jan 11 '19 at 11:03
  • 1
    Well... If your application uses UTF-8 (as your question implies) and you read a UTF-16 file assuming it's UTF-8 you'll either end up with corrupt data (if you merge UTF-8 chars with UTF-16 chars) or you'll have a correct UTF-16 stream that cannot be decoded because it misses the byte-order information. – Álvaro González Jan 11 '19 at 12:04
  • This was closed as duplicate but the linked question didn't apply here because we're talking about a (mandatory) UTF-16 BOM (the UTF-8 BOM is `0xEFBBBF` and it's optional. – Álvaro González Jan 11 '19 at 12:07
  • 1
    this character appearing at the beginning of a file as BOM is sort of a separate issue compared to when we may encounter this character in other situations. For example when we are programmatically dealing with strings coming out of a database sometimes this character randomly appears at the beginning or end. Besides, usually what's happening is that whatever software we're trying to use cannot interpret this character so it's quite valid to ask how can we zap it out of existence in order to get on with our lives. – Steven Lu Dec 19 '21 at 21:38
  • @ÁlvaroGonzález https://softwareengineering.stackexchange.com/q/370088/36174 contests your point about how this codepoint is mandatory as the BOM in UTF-16. BOM does not seem to be mandatory. – Steven Lu Dec 19 '21 at 21:41

4 Answers4

17
$columns[0] = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $columns[0]);

The above code helps you remove hidden characters that exist in your document, just like the one you mentioned.

pr1nc3
  • 8,108
  • 3
  • 23
  • 36
  • PD of [How do I remove  from the beginning of a file?](//stackoverflow.com/a/20692384). And this removes quite a bit more. – mario Jan 11 '19 at 11:00
  • It works, thank you. Can I know what I removing exactly with that regex? – Angel Luis Jan 11 '19 at 11:02
  • 1
    You are removing special characters. – pr1nc3 Jan 11 '19 at 11:03
  • @pr1nc3 In each page I see different regex for do the same task. – Angel Luis Jan 11 '19 at 11:05
  • We are talking about BOM, why not just test the first 2 bytes and discard that bytes. No replace or search. BOM can be only as first codepoint. – Giacomo Catenazzi Jan 11 '19 at 13:18
  • While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Yunnosch Oct 25 '21 at 09:32
  • 2
    Be aware, this also remove accents – Cyprien Jan 09 '22 at 19:19
  • 1
    this is not right, it will remove things like: ’ – HelpNeeder Sep 01 '22 at 02:33
  • @Cyprien thanks for noting. I believe https://stackoverflow.com/a/68841510/720323 should be the right answer. – HelpNeeder Sep 01 '22 at 02:38
10
$result = trim($result, "\xEF\xBB\xBF");

This is the simplest way to solve it.

Terry Lin
  • 2,529
  • 22
  • 21
  • While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Yunnosch Oct 25 '21 at 09:31
  • Thank you! this is solution i was looking for! @Yunnosch oh please... – HelpNeeder Sep 01 '22 at 02:35
  • Good one. Thanks. No side effects like removing accents in strings. – Qrzysio Oct 26 '22 at 09:57
1
$headings=array();    
$handle = fopen($_FILES["contacts_file"]["tmp_name"], "r");  
$heading_data=fgetcsv($handle);
             foreach($heading_data as $heading){

              // Remove any invalid or hidden characters
              $heading = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $heading);

              array_push($headings, $heading);
         }
James Ikubi
  • 2,552
  • 25
  • 18
  • While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Yunnosch Oct 25 '21 at 09:32
  • This was helpful to me, due to the context - I'm processing the CSV file upload in Codeigniter and I needed a way to strip the BOM from the first line of the file. – WojCup Jan 27 '22 at 14:06
1

$columns[0] = preg_replace('/\xEF\xBB\xBF/', '', $columns[0]);

or

$columns[0] = trim($columns[0], "\xEF\xBB\xBF");

  • 1
    Welcome to Stack Overflow! While this code may solve the question, [including an explanation](//meta.stackexchange.com/q/114762) of how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. Remember that you are answering the question for readers in the future, not just the person asking now. Please [edit] your answer to add explanations and give an indication of what limitations and assumptions apply. – Yunnosch Oct 25 '21 at 09:29
  • Doing so please especially elaborate the functional difference to the two existing answers https://stackoverflow.com/a/68841510/7733418 and https://stackoverflow.com/a/54145162/7733418 Because this post does not seem to add anything. – Yunnosch Oct 25 '21 at 09:30