1

One table space and tab separated, and need to separate the fields by semicolon, have tried with awk directly but didn't work. Taking one perl script to do this work with tables with ASCII style pipe separated and underscore, can't do it if I don't have some of this stuff to do the same job.

Name full                  CI       FG   AG DG Date (UTC) Virnia Ray  
34842865 093161455 -    -     2019-07-12T12:09:31.378Z Vitoxia Sureez 
40151215 094063155 36.3   -     2019-07-14T13:18:11.733Z

Already tried

 sed -e 's/^[ t]*//' -e 's/ /\;/g'

to remove all the spaces

Perl script maded by L. Scott originally to convert table ASCII-styled

 while(<>) {
     @vals = split / /; # split fields into the val array taking space separator
     $size = @vals;
     for( $i = 0 ; $i < $size ; $i++ )
     {
         #clean up the values: remove underscores and extra spaces in the fields and remove possible semicolons there
         $vals[$i] =~ s/_/ /g;
         $vals[$i] =~ s/;/ /g;
         $vals[$i] =~ s/^ *//;
         $vals[$i] =~ s/ *$//;

         # append the value to the data record for this field
         $data[$i] .= $vals[$i];

         # special handling for first field: use spaces when joining
         $data[$i] .= " " if ($i==0); #do not know if this is necessary to the new requirement as we have space in more than the first field.

     }
    if(/\R/)  # Taking carriage return as the end of record
     {
         # clean up the first record; trim spaces
         $data[0] =~ s/^ *//;
         $data[0] =~ s/ *$//;
         $data[3] =~ s/\..*//; # remove the point and decimal for the field four
         # join the records with semicolons
         $line = join (";", @data);

         # collapse multiple spaces
         $line =~ s/ +/ /g;

         # print this line and start over
         print "$line\n" unless ($line eq '');
         @data = ();
     } }

Expecting:

Name full;CI;FG;AG;DG;Date (UTC)
Virnia Ray;34842865;093161455;-;-;2019-07-12T12:09:31.378Z
Vitoxia Sureez;40151215;094063155;36;-;2019-07-14T13:18:11.733Z

Current output:

Name;full;;;;;;;;;;;;;;;;;;CI;;;;;;;FG;;;AG;DG;Date;(UTC)    
Virnia;Ray;;;;;;;;;;;;;;;;;;;34842865;093161455;-;;;;-;;;;;2019-07-1T12:09:31.378Z    
Vitoxia;Sureez;;;;;;;;;;;;;;;;;;40151215;094063155;36;;;-;;;;;2019-07-14T13:18:11.733Z

I have some cases with the first field are like:

Mar▒a Xatia Mecrdiz
M▒ndrz, yrcr▒a
cdcsurtmz at ruy opdx
lxtrb mxs2axs rl tsactfg
re xorts tdz drfod t     33743642 095518568 41   -     2019-06-12T13:48:40.200Z
zude def rtexetggacvc
opyxo ae f▒xuda tcso
dxzdtctfgs ti x9mdfggfhh
sx 7dfgab, asvro oi sz op
dgeto jxgdmszdd.

I only need in the first field the data before the comma, all after this will be drop. As you can see the "line" of the rest of the data in the row are not in the same line..

The original data come from one HTML code parsed by html2text the original code is:

  <b>Mon Jul 05 2019</b><hr><table style="border: 1px solid
#dddddd;border-collapse: collapse;text-align: left;"><tr><th style="padding: 8px;background-color: #cce6ff">Name Full</th><th style="padding: 8px;background-color: #cce6ff">FG</th><th style="padding: 8px;background-color: #cce6ff">CG</th><th style="padding: 8px;background-color: #cce6ff">AG</th><th style="padding: 8px;background-color: #cce6ff">MG</th><th style="padding: 8px;background-color: #cce6ff">Date (UTC)</th><tr><th style="padding: 8px;background-color: #dddddd">Mrída Xatia Mecrdiz Míndrz, yrcrría cdcsurtmz at ruy opdxlxtrb mxs2axs rl tsactfgre xorts tdz drfod t   zude def rtexetggacvcopyxo ae féxuda tcsodxzdtctfgs ti x9mdfggfhhsx 7dfgab, asvro oi sz op
    dgeto jxgdmszdd.</th><th style="padding: 8px;background-color: #dddddd">33743642</th><th style="padding: 8px;background-color: #dddddd">095518568</th><th style="padding: 8px;background-color: #dddddd">41</th><th style="padding: 8px;background-color: #dddddd">-</th><th style="padding: 8px;background-color: #dddddd">2019-05-12T13:48:40.200Z</th></tr><tr><th style="padding: 8px;">Cdlga foxa</th><th style="padding: 8px;">45285726</th><th style="padding: 8px;">092641968</th><th style="padding: 8px;">28</th><th style="padding: 8px;">-</th><th style="padding: 8px;">2019-06-11T13:50:52.091Z</th></tr></table>

Maybe there some util to use instead html2text here to do this work in a better shape directly from the render tool.

Here the html table with more records.

<b>Mon Jul 05 2019</b><hr>
<table style="border: 1px solid #dddddd;border-collapse: collapse;text-align: left;"><tr>
<th style="padding: 8px;background-color: #cce6ff">Name Full</th>
<th style="padding:8px;background-color: #cce6ff">FG</th>
<th style="padding: 8px;background-color: #cce6ff">CG</th>
<th style="padding: 8px;background-color: #cce6ff">AG</th>
<th style="padding: 8px;background-color: #cce6ff">MG</th>
<th style="padding: 8px;background-color: #cce6ff">Date (UTC)</th></tr>
<tr><th style="padding: 8px;">Mrída Xatia Mecrdiz Míndrz, yrcrría cdcsurtmz at ruy opdxlxtrb mxs2axs rl tsactfgre xorts tdz drfod t   zude def rtexetggacvcopyxo ae féxuda tcsodxzdtctfgs ti x9mdfggfhhsx 7dfgab, asvro oi sz op         dgeto jxgdmszdd.</th>
<th style="padding: 8px;">33743642</th>
<th style="padding: 8px;">095518568</th><th style="padding: 8px;">41</th><th style="p
adding: 8px;">-</th><th style="padding: 8px;">2019-05-12T11:47:01.240Z</th></tr>
<tr><th style="padding: 8px;background-color: #dddddd">Cdlga foxa</th>
<th style="padding: 8px;background-color: #dddddd">45285726</th>
<th style="padding: 8px;background-color: #dddddd">092641968</th>
<th style="padding: 8px;background-color: #dddddd">28</th>
<th style="padding: 8px;background-color: #dddddd">-</th>
<th style="padding: 8px;background-color: #dddddd">2019-06-11T11:48:51.806Z</th></tr>
<tr><th style="padding: 8px;">Qrala Xera</th>
<th style="padding: 8px;">33184756</th>
<th style="padding: 8px;">032178032</th>
<th style="padding: 8px;">-</th>
<th style="padding: 8px;">-</th>
<th style="padding: 8px;">2019-03-01T11:55:04.269Z</th></tr>
<tr><th style="padding: 8px;background-color: #dddddd">Mpa Fagun;Mor@asd. Prq*yqesla, LEllal4331</th>
<th style="padding: 8px;background-color: #dddddd">54324252</th>
<th style="padding: 8px;background-color: #dddddd">034021061</th>
<th style="padding: 8px;background-color: #dddddd">-</th>
<th style="padding: 8px;background-color: #dddddd">-</th>
<th style="padding: 8px;background-color: #dddddd">2019-04-12T11:58:15.349Z</th></tr>
<tr><th style="padding: 8px;">xOpàr '00083</th>
<th style="padding: 8px;">13702194</th>
<th style="padding: 8px;">197071330</th>
<th style="padding: 8px;">40.2</th>
<th style="padding: 8px;">-</th>
<th style="padding: 8px;">2019-07-15T12:00:28.617Z</th></tr>
<tr><th style="padding: 8px;background-color: #dddddd">Drlia >·xa1otta</th>
<th style="padding: 8px;background-color: #dddddd">34253138</th>
<th style="padding: 8px;background-color: #dddddd">394995572</th>
<th style="padding: 8px;background-color: #dddddd">68</th>
<th style="padding: 8px;background-color: #dddddd">-</th>
<th style="padding: 8px;background-color: #dddddd">2019-07-12T12:32:19.793Z</th></tr>
  • What does the output from the Perl script look like? – Mark Stewart Jul 15 '19 at 18:11
  • 1
    Already edited with this included, thanks! – User1234141414 Jul 15 '19 at 18:34
  • 2
    The names in your sample data are shown at the end of a row -- and the first one (`Virnia Ray`) is even in the header row. Is this a mistake in posting? Are the names meant to be at the beginning? (Correct that if it's wrong.) Also -- is it _always_ two words for the full name? – zdim Jul 15 '19 at 19:57
  • Seeing as your original table apparently comes from HTML, it would make a lot more sense to fix the original converter to use semicolon separators instead, rather than attempt to sort it out heuristically in a separate postprocessor after conversion. The Perl script I ad-hocked up in [your previous question](/questions/57010927/email-html-to-csv-file) might be a start. – tripleee Jul 16 '19 at 03:52
  • Yes tripleee I think the html2text can be reviewed to make this work in a best way, but I haved tested w3m and lynx all in the combo and no way to do a good table (from the original html code) to simplify the conversion to text, the best output from this was all the data with no spaces between, removing all the "table" data, but I guess this will be more complex the work of back to do the table. – User1234141414 Jul 16 '19 at 11:17
  • Zdim the table are pasted from the original source data, if I touch the number of spaces I tough this can be a problem, because the original source comes from unicode. – User1234141414 Jul 16 '19 at 11:20
  • Tripleee had updated the post to the original data source html maybe someone know some better than html2text. – User1234141414 Jul 16 '19 at 13:45
  • Can you please post more parts of original source HTML may be add 2 or more entry? I'd like to see the pattern for the next entries. Also, to clarify, in this input -- Mrída Xatia Mecrdiz Míndrz, yrcrría cdcsurtmz at ruy opdxlxtrb mxs2axs rl tsactfgre xorts tdz drfod t zude def rtexetggacvcopyxo ae féxuda tcsodxzdtctfgs ti x9mdfggfhhsx 7dfgab, asvro oi sz op -- you only need, "Mrída Xatia Mecrdiz Míndrz", Right? – TheYeti Jul 17 '19 at 09:53
  • The Yeti it´s right I need all the words before the comma I go to edit the question to add more html code with 2 more. – User1234141414 Jul 17 '19 at 11:13
  • In other cases the full name comes with special characters and the long of the data its unknown , maybe the cut are the special character all before this included the character will be droped.. – User1234141414 Jul 17 '19 at 11:37
  • Already updated the html code. – User1234141414 Jul 17 '19 at 18:47

1 Answers1

0

I still don't have enough rep to comment, so I assumed that the names are composed of first name and last name (It says full name) and are not blank.

Code:

while (<>) {
    #Removes new line
    chomp;

    #Sets delimiter
    $delimiter = ";";

    #clean up text
    #removes _ and ;
    s![_;]!!g; 

    #removes leading spaces                 
    s!^ *!!;        

    #replace multiple whitespace (you mentioned it's delimited by space and tab) with ;            
    s/(\s)\1*/$delimiter/ge;    
    
    #Remove Delimiter from the Date (UTC). /i ignores the case
    s/Date.+\(UTC\)/Date (UTC)/i;   

    #Removes delimiter in Full name
    s/^([^;]+);([^;]+)/$1 $2/i; 

    #Removes decimal on Field 4
    s/^([^;]+;[^;]+;[^;]+;)([^;.]+)\.?[^;]*/"$1$2"/e; 
    
    print "$_\n";
}

Output

Name full;CI;FG;AG;DG;Date (UTC)
Virnia Ray;34842865;093161455;-;-;2019-07-12T12:09:31.378Z
Vitoxia Sureez;40151215;094063155;36;-;2019-07-14T13:18:11.733Z

Note(s)

I only used ! in some of the regex to fix the syntax highlighting when using /

Community
  • 1
  • 1
TheYeti
  • 21
  • 2
  • TheYeti can you post the case if I get a blank in the field data (not in the header)? – User1234141414 Jul 16 '19 at 11:23
  • Other problem that can be possible is the no limit of the characters in the first field, it can come with single comma or semicolon and spaces. – User1234141414 Jul 16 '19 at 11:52
  • This code puts the semicolon on the second space of the full name as I saw. It only works when full name have 2 words. Please read my edited question with some example of this field – User1234141414 Jul 16 '19 at 12:59