Cleaning up CSV (artifacts and lack of spacing)

Question

My data was in this form. Where you can see that the 3rd column (2nd if you start with 0) touches the one before when it's values rise to the next order of magnitude. As well as artifacts in the last column that are from none data input being recorded.

17:10:39 2.039 26.84 4.6371E-9    -0.7$R200$O100
17:10:41 2.082 27.04 4.6334E-9    -0.4
17:10:43 1.980 26.97 4.6461E-9     0.3
17:10:45 2.031 26.87 4.6502E-9     1.0$R200
17:10:47 2.090 27.09 4.6296E-9     0.1
...
18:49:40 1.930226.34 2.8246E-5     7.1
18:49:42 2.031226.04 2.8264E-5     8.2

Now I did fix this all by hand by adding a "|" deliminator instead " ", and cutting away the few artifacts, but it was a pain.

So in the prospect of getting even larger data sets in the future from the same machine, are there any tips on how to write a script in python or if there are any linux based tools out there already to fix this csv/make a new fixed csv out of this ?

Why not fix the "thing" that's creating the data set so that the delimiter is " " or "|" or something besides ""? — Jim, Jun 20 '17 at 21:29
Can't, the "thing" is a delicate machine and switching anything about it hardware of software would bring enormous calibration costs. There is a new "thing 2.0" in the calibration process for a couple of years now because "thing" is old as beans by now, but the "thing 2.0" signal noise issues are quite difficult right now, so the original "thing" + fixes will be still in use for some time. — Katpton Liamfuppinshire, Jun 20 '17 at 21:48

Katpton Liamfuppinshire · Answer 1 · 2017-06-20T22:17:41.773

In linux shell:

cut -c 1-14 data.csv > DataA
cut -c 15-49 data.csv > DataB
paste DataA DataB | tr -s " " "\t" > DataC

cuts the csv into two parts, with the intersection being where they touch, also in the second part we cut away the added unwanted artifacts.
paste them together and change the delimiter for a tab as paste adds a tab

Now in case we'd want to stick to the "|" delimiter the next step could be

cat DataC | tr -s "\t" "|" > DataFinal
rm DataA DataB DataC

But this is purely optional

score 0 · Answer 2 · answered Jun 20 '17 at 22:22

The data you are showing is not csv (or dsv), but plain text data with fixed field widths. Trying to read this as csv will be error prone.

Instead this data should be processed as fixed width with the following field widths:

8 / 6 / 6 / 10 (or 11) / 8 (or 7) / rest of line

See this question on how to parse fixed width fields in Python.

Cleaning up CSV (artifacts and lack of spacing)

2 Answers2