1

When I upload an text file (*.txt) with an php upload script and send it to the DB there are an lot of unwanted characters, they won't show up on the screen but are shown as � in the database (after each normal character).

This is the text I am uploading:

File                test_02
Date                15. Juni 2018
Start of Meas.      11:09
Tester              
Probe/Test Force    Sono50/50N
Probe-SN            777
Dwell Time          0 sec
Material table      Steel   A1
Norm; HV            EN ISO 18265
Adjustment File     Unnamed
Adj. Number         0
Limits              Off
Number              4
Mean                773,0   HV
Std. Deviation      9,5 HV  1,2%
Maximum             785,0   HV
Minimum             763,8   HV
R                   21,2    HV  2,7%
Cp
Cpk

1           763,8   HV
2           785,0   HV
3           775,8   HV
4           767,1   HV

So I have written some code to clean it up but I am missing some crucial spaces now. Where dit it go wrong and how to correct this?

$lines = file($_FILES['uploaded']['tmp_name']); //file in to an array

print_r gives

Array ( [0] => ��File test_02 1 => Date 15. Juni 2018 [2] => Start of Meas. 11:09 [3] => Tester [4] => Probe/Test Force Sono50/50N [5] => Probe-SN 777 [6] => Dwell Time 0 sec [7] => Material table Steel A1 [8] => Norm; HV EN ISO 18265 [9] => Adjustment File Unnamed [10] => Adj. Number 0 [11] => Limits Off [12] => Number 4 [13] => Mean 773,0 HV [14] => Std. Deviation 9,5 HV 1,2% [15] => Maximum 785,0 HV [16] => Minimum 763,8 HV [17] => R 21,2 HV 2,7% [18] => Cp [19] => Cpk [20] => [21] => 1 763,8 HV [22] => 2 785,0 HV [23] => 3 775,8 HV [24] => 4 767,1 HV [25] => ) 1

This is my trick to change all unwanted characters to underscores and then replace all underscores with one space.

<?php
// convert spaces to underscore
$lines_01 = str_replace(' ', '_', $lines[01]);
$lines_02 = str_replace(' ', '_', $lines[02]);
$lines_04 = str_replace(' ', '_', $lines[04]);
$lines_05 = str_replace(' ', '_', $lines[05]);
$lines_06 = str_replace(' ', '_', $lines[06]);
$lines_07 = str_replace(' ', '_', $lines[07]);
$lines_08 = str_replace(' ', '_', $lines[08]);
$lines_14 = str_replace(' ', '_', $lines[14]);
$lines_17 = str_replace(' ', '_', $lines[17]);
$lines_21 = str_replace(' ', '_', $lines[21]);
$lines_22 = str_replace(' ', '_', $lines[22]);
$lines_23 = str_replace(' ', '_', $lines[23]);
$lines_24 = str_replace(' ', '_', $lines[24]);

// remove unwanted text and keep normal charcaters
$lines_01 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_01);
$lines_02 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_02);
$lines_04 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_04);
$lines_05 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_05);
$lines_06 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_06);
$lines_07 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_07);
$lines_08 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_08);
$lines_14 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_14);
$lines_17 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_17);
$lines_21 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_21);
$lines_22 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_22);
$lines_23 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_23);
$lines_24 = preg_replace('/[^A-Za-z0-9\,.:_]/', '', $lines_24);

// convert one or multipe underscore to spaces
$lines_01 = preg_replace('/_+/', ' ', $lines_01);
$lines_02 = preg_replace('/_+/', ' ', $lines_02);
$lines_04 = preg_replace('/_+/', ' ', $lines_04);
$lines_05 = preg_replace('/_+/', ' ', $lines_05);
$lines_06 = preg_replace('/_+/', ' ', $lines_06);
$lines_07 = preg_replace('/_+/', ' ', $lines_07);
$lines_08 = preg_replace('/_+/', ' ', $lines_08);
$lines_14 = preg_replace('/_+/', ' ', $lines_14);
$lines_17 = preg_replace('/_+/', ' ', $lines_17);
$lines_21 = preg_replace('/_+/', ' ', $lines_21);
$lines_22 = preg_replace('/_+/', ' ', $lines_22);
$lines_23 = preg_replace('/_+/', ' ', $lines_23);
$lines_24 = preg_replace('/_+/', ' ', $lines_24);

// remove unwanted text
$lines_01 = str_replace('Date ', '', $lines_01);
$lines_02 = str_replace('Start of Meas. ', '', $lines_02);
$lines_04 = str_replace('ProbeTest Force ', '', $lines_04);
$lines_05 = str_replace('ProbeSN ', '', $lines_05);
$lines_06 = str_replace('Dwell Time ', '', $lines_06);
$lines_07 = str_replace('Material table ', '', $lines_07);
$lines_08 = str_replace('Norm HV', '', $lines_08);
$lines_14 = str_replace('Std. Deviation ', '', $lines_14);
$lines_17 = str_replace('R ', '', $lines_17);
$lines_21 = str_replace('1 ', '', $lines_21);
$lines_22 = str_replace('2 ', '', $lines_22);
$lines_23 = str_replace('3 ', '', $lines_23);
$lines_24 = str_replace('4 ', '', $lines_24);
?>

Left what is send to the DB, right is what I would like. Please advice enter image description here

Muiter
  • 1,470
  • 6
  • 26
  • 39

2 Answers2

1

It looks like your text file has a BOM (https://en.wikipedia.org/wiki/Byte_order_mark) at the very beginning, in the first two bytes.

Diagnose this possibility using xxd utility (available on Unix/Linux, and even Cygwin. Maybe also available online).

Example:

xxd -l2

would display fffe if your file is marked with a byte order.

The same utility may also help you to determine what other 'junk' characters your file has. In this case, just use xxd your_file_here, and see what you may have missed.

Typically it may provoke these question marks, showing that some UTF-8 coded characters were unresolved.

Programmatically, you may want to open your file in binary mode, and fseek() 2 bytes forward when reading it, or amend and trim these bytes ahead of processing it, with a professional-grade editor. For example using Ultra-Edit, and switch to Hexadecimal mode (Ctrl+H).

Fabien Haddadi
  • 1,814
  • 17
  • 22
  • Some editors will automatically add a BOM when they save a file. There should be an option to disable this if this is a problem. Sometimes in the Save As dialog itself, sometimes in the Settings dialog. – Fabien Haddadi Jun 19 '18 at 11:45
  • Thank you for your detailed explanation Fabien. – Muiter Jun 19 '18 at 16:22
0

Is the encoding of the txt file AND your php script's utf-8 without BOM? How about the database? And the db connection?

If you're only going to do this once or twice then proceed to substr() the strings where you want to insert a character, otherwise you should fiddle around with character encodings. http://php.net/substr

Vörös Imi
  • 319
  • 4
  • 9