Best Way to Clean Up Dataset

Question

I'm looking for the best way to clean up this data into something that I can input as a pandas dataframe for further analysis. The first few lines of the data are currently structured as follows (there are several thousand lines of data):

  4  0 1.  1 37.0 10.000  0  3 18. 40.0  7  4  0.  0.  1.                   
  5  0 1.  0 27.0  4.000  0  4 14. 20.0  6  4  0.  0.  1.                   
 11  0 1.  0 32.0 15.000  1  1 12. 12.5  1  4  0.  0.  1.                   
 16  0 1.  1 57.0 15.000  1  5 18. 12.5  6  5  0.  0.  1.                   
 23  0 1.  1 22.0  0.750  0  2 17.  7.5  6  3  0.  0.  1.                   
 29  0 1.  0 32.0  1.500  0  2 17.  7.5  5  5  0.  0.  1.                   
 44  0 1.  0 22.0  0.750  0  2 12. 12.5  1  3  0.  0.  1.                   
 45  0 1.  1 57.0 15.000  1  2 14. 20.0  4  4  0.  0.  1.                   
 47  0 1.  0 32.0 15.000  1  4 16. 20.0  1  2  0.  0.  1.                   
 49  0 1.  1 22.0  1.500  0  4 14. 12.5  4  5  0.  0.  1.                   
 50  0 1.  1 37.0 15.000  1  2 20. 20.0  7  2  0.  0.  1.                   
 55  0 1.  1 27.0  4.000  1  4 18. 12.5  6  4  0.  0.  1.

The columns are as follows:

identifier, not used, constant, z1, z2, z3, z4, z5, z6, not used, z7, z8, yPT, not used, not used

So not every column in the data is significant however that is something I can easily take care of if this can be converted into a pandas dataframe.

https://stackoverflow.com/questions/9652832/how-to-load-a-tsv-file-into-a-pandas-dataframe — treetopper, Dec 27 '19 at 16:36

score 1 · Answer 1 · answered Dec 27 '19 at 16:44

Try (you can also read it from a file, if it's easier for you):

import pandas as pd
input_='''
 4  0 1.  1 37.0 10.000  0  3 18. 40.0  7  4  0.  0.  1.                   
  5  0 1.  0 27.0  4.000  0  4 14. 20.0  6  4  0.  0.  1.                   
 11  0 1.  0 32.0 15.000  1  1 12. 12.5  1  4  0.  0.  1.                   
 16  0 1.  1 57.0 15.000  1  5 18. 12.5  6  5  0.  0.  1.                   
 23  0 1.  1 22.0  0.750  0  2 17.  7.5  6  3  0.  0.  1.                   
 29  0 1.  0 32.0  1.500  0  2 17.  7.5  5  5  0.  0.  1.                   
 44  0 1.  0 22.0  0.750  0  2 12. 12.5  1  3  0.  0.  1.                   
 45  0 1.  1 57.0 15.000  1  2 14. 20.0  4  4  0.  0.  1.                   
 47  0 1.  0 32.0 15.000  1  4 16. 20.0  1  2  0.  0.  1.                   
 49  0 1.  1 22.0  1.500  0  4 14. 12.5  4  5  0.  0.  1.                   
 50  0 1.  1 37.0 15.000  1  2 20. 20.0  7  2  0.  0.  1.                   
 55  0 1.  1 27.0  4.000  1  4 18. 12.5  6  4  0.  0.  1.
'''

input_arr=[[cell for cell in row.split(" ") if cell!=""] for row in input_.split("\n") if row!=""]

hdrs=["identifier", "not used", "constant","z1", "z2", "z3", "z4", "z5", "z6", "not used", "z7", "z8", "yPT", "not used", "not used"]

df=pd.DataFrame(data=input_arr, columns=hdrs)
df=df[[col for col in df.columns if col!="not used"]]
print(df.head(10))

Output:

  identifier constant z1    z2      z3 z4 z5   z6 z7 z8 yPT
0          4       1.  1  37.0  10.000  0  3  18.  7  4  0.
1          5       1.  0  27.0   4.000  0  4  14.  6  4  0.
2         11       1.  0  32.0  15.000  1  1  12.  1  4  0.
3         16       1.  1  57.0  15.000  1  5  18.  6  5  0.
4         23       1.  1  22.0   0.750  0  2  17.  6  3  0.
5         29       1.  0  32.0   1.500  0  2  17.  5  5  0.
6         44       1.  0  22.0   0.750  0  2  12.  1  3  0.
7         45       1.  1  57.0  15.000  1  2  14.  4  4  0.
8         47       1.  0  32.0  15.000  1  4  16.  1  2  0.
9         49       1.  1  22.0   1.500  0  4  14.  4  5  0.

This has helped me out immensely, thank you! – the_contractor Dec 27 '19 at 17:02 — the_contractor, Dec 27 '19 at 17:02

Best Way to Clean Up Dataset

1 Answers1