1

I have a long column of text data, which likes this:

apple
162
30.45%
newyork
red
2018-12-10  22:48
3.23
Nop12345
pear
20
14.56%
washington
green
2018-12-09  10:30
4.24
Nok45367

I want it tab delimited as the following and it can be dropped in excel with 8 columns:

apple   162 30.45%  newyork red 2018-12-10 12:48    3.23    Nop12345
pear    20  14.56%  washington  green   2018-12-09  10:30   4.24    Nok45367

I have used the command

awk '{ ORS = (NR%8 ? "\t" : RS) } 1' > output.txt

to handle this stuff, the output is just like the structure that I need above if you see the results on windows accessory notepad editor, however, the truth is that it is not the 8 column style when you see it with notepad++, or some other txt editor on linux, and the worse is that it is only presented with just 2 columns if you drop it in excel as this:

apple
    162
    30.45%
    newyork
    red
    2018-12-10 12:48
    3.23
    Nop12345
pear
    20
    14.56%
    washington
    green
    2018-12-09  10:30
    4.24
    Nok45367
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
james
  • 81
  • 7

4 Answers4

1

With awk since you have tried it, could you please try following. Append | column -t to following code in case you need output in TAB delimited form.

awk '
/[a-zA-z]+[0-9]+/{
  print val OFS $0
  val=""
  next
}
NF{
  val=(val?val OFS:"")$0
}
END{
  if(val){
    print val
  }
}'   Input_file

Also when I tried your code I was able to get correct output(added above solution as an alternate one), could you please check once if your Input_file is having control M characters in them by doing cat -v Input_file if yes then trey removing them by doing td -d '\r' < Input_file > temp_file && mv temp_file Input_file.

Explanation: Adding explanation of above code here.

awk '                       ##Starting awk program here.
/^[a-zA-z]+[0-9]+/{         ##Checking condition if a line is starting alphabets with digits then do following.
  print val OFS $0          ##Printing variable val with OFS and current line here.
  val=""                    ##Nullifying val here.
  next                      ##next will skip all further statements here.
}
NF{                         ##Checking condition if line is NOT BLANK then do following.
  val=(val?val OFS:"")$0    ##Creating variable val whose value keep concatenating its own value.
}
END{                        ##Starting END block of this awk code here.
  if(val){                  ##Checking condition in case variable val is NOT NULL then do following.
    print val               ##Printing variable val here.
  }
}'  Input_file              ##Mentioning Input_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
  • 1
    Could you reproduce the problem OP is facing? I couldn't. – oguz ismail Dec 13 '18 at 06:52
  • 1
    @oguzismail, yes I tried it and it worked for me. Let me check with OP if OP has control M characters in Input_file. – RavinderSingh13 Dec 13 '18 at 06:53
  • 1
    @james, your code works for me, could you please check if any if control M characters are present in your Input_file by using `cat -v Input_file` if yes then try to remove them by doing `tr -d '\r < Input_file > temp_file && mv temp_file Input_file` and let me know then. – RavinderSingh13 Dec 13 '18 at 06:55
  • 2
    @oguzismail, I added my suggestion for checking carriage returns too in my solution, thanks buddy for letting me know. – RavinderSingh13 Dec 13 '18 at 06:57
  • 1
    Thank you very much! I have tried the command tr -d '\r < Input_file > temp_file , to handle the input_file, and then awk '{ ORS = (NR%8 ? "\t" : RS) } 1' temp_file > output.txt , it really works, thanks buddy for the suggestions! – james Dec 14 '18 at 02:36
  • @james, you could give up-vote for helpful answers and can select this answer as correct answer too, cheers and happy learning. – RavinderSingh13 Dec 14 '18 at 02:46
0

input.txt enter image description here

Code

import os
import numpy as np
import pandas as pd
text_file = open("input.txt", "r")
n = 8
raw_data = text_file.read().split("\n")
data = np.array(raw_data)
data.shape = (len(data)//n, n)
df = pd.DataFrame(data)
df.to_excel("output.xlsx", index=False, header=False)

output.xlsx shown in Excel

enter image description here

Indominus
  • 1,228
  • 15
  • 31
  • thank you very much, I copy your code as python script -text2csv1.py, and run in the target directory, some error occurent: Traceback (most recent call last): File "text2csv1.py", line 8, in data.shape = (len(data)/n, n) ValueError: total size of new array must be unchanged – james Dec 14 '18 at 03:38
  • What is `len(data)` in your code? Btw, are you using the same input file as I did? I copied from your question exactly. – Indominus Dec 14 '18 at 04:14
  • Just quickly noticed you have `len(data)/n`, while my code is `len(data)//n`, `//` is the integer division. Could you try whether that solves the issue? Of course the whole thing assumes `len(data)` is a multiple of `n`, if not then we need to modified the code. – Indominus Dec 14 '18 at 04:27
  • Thanks buddy for your codes and explanation, yes, it is the same problem when I use len(data)//n , many thanks! – james Dec 14 '18 at 10:31
0

What you already have IS the right way to do what you want:

$ awk '{ORS=(NR%8 ? "\t" : RS)}1' file
apple   162     30.45%  newyork red     2018-12-10  22:48       3.23    Nop12345
pear    20      14.56%  washington      green   2018-12-09  10:30       4.24   Nok45367

but see Why does my tool output overwrite itself and how do I fix it? for the source of most "my output looks funny" issues.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    thanks,Ed, you hit the point! I read the links you suggested, there are many options, rules, tricks......, Back to the question, your solution is the simplest! I will use sed 's/\x0D//' inputfile > tempfile, to delete carriage and then awk '{ ORS = (NR%8 ? "\t" : RS) } 1' tempfile > output.txt, it is well done! – james Dec 14 '18 at 02:48
  • Thanks but using `sed 's/\x0D//'` is actually about the worst way to handle this since it's lengthier and/or more complicated and/or less robust than the other solutions posted in my answer to that other question. – Ed Morton Dec 14 '18 at 02:51
  • 1
    Thanks,Ed, I will use it carefully. – james Dec 14 '18 at 10:34
0

Check this awk solution:

/tmp> cat james.txt
apple
162
30.45%
newyork
red
2018-12-10  22:48
3.23
Nop12345
pear
20
14.56%
washington
green
2018-12-09  10:30
4.24
Nok45367
/tmp> awk -F" "  '{ printf("%s\t",$0); if(NR%8==0) { printf("\n") } } ' james.txt
apple   162     30.45%  newyork red     2018-12-10  22:48       3.23    Nop12345
pear    20      14.56%  washington      green   2018-12-09  10:30       4.24    Nok45367
/tmp>

Thanks to Ed, shortening further

/tmp> awk '{ printf "%s\t",$0; if(NR%8==0) { print "" } } ' james.txt
apple   162     30.45%  newyork red     2018-12-10  22:48       3.23    Nop12345
pear    20      14.56%  washington      green   2018-12-09  10:30       4.24    Nok45367
/tmp>
stack0114106
  • 8,534
  • 3
  • 13
  • 38
  • 1
    thanks Ed.. Im addicted to that kind of usage :-).. I'll update the answer – stack0114106 Dec 13 '18 at 15:56
  • thanks stack, the result of your script is 8 column structure when you see it in gedit, but it actually not the case, when you drop it in excel, it resembles as the question asked, please see comments to Ed' suggestions. – james Dec 14 '18 at 03:43
  • Nope, I am running your codes on ubuntu, I am sorry about that I made a mistake of "... 8 column structure when you see it in gedit...", it should be "... 8 column structure when you see it in gvim...", your codes are excellent when james.txt have been proccessed with command sed 's/\x0D//' or tr -d '\r' to delete the carriage.After you doing that, you can use your codes to get the results, and more importantly, drop the result into LibreOffice calc, or excel, you will see the correct presented results! Thanks buddy for your suggestions.;) – james Dec 14 '18 at 10:14
  • Maybe you see the results in bash window, it is just like in gvim and gnu emacs, which are different from gedit,or excel which I wanted. – james Dec 14 '18 at 10:22