The follow are two rows from my unlabeled dataset, a small subset:
random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX AAA:COL:UVTWUVWDUWDUWDWW BBB:COL:F CCC:COL:GTATGTCA DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5 HHH:COL:000.1 III:COL:2 JJJ:COL:0
random2 123 sub1 996 12 kwnc239 10027 144 LBPRLBPSLBRDLBSDLBSLLBWB AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F DDD:COL:CACGTCGG EEE:COL:K19 FFF:COL:HCC16 GGG:COL:873 III:COL:-77 JJJ:COL:0 KKK:COL:0 LLL:COL:1 MMM:COL:212
The first nine columns are consistent throughout the dataset, and could be labeled.
My problem is with the following columns. Each value in this row is then labeled with the column value first, e.g. AAA:COL:UVTWUVWDUWDUWDWW
is column AAA
, BBB:COL:F
is column BBB
, etc.
However, (1) each row does not have the same number of columns and (2) some columns are "missing". The first row is missing column FFF
, the second row skips column CCC
and HHH
.
Also, notice that the first row stops at column JJJ
, while the second column stops at column MMM
.
How would one allocate 9 + 13 columns of a dataframe, and parse these values such that if a column:value
pair didn't exist, this column would have a NaN
value.
Would something like pandas.read_table()
have the functionality for this?
This is the "correct" format for the first row:
random int sub int2 int3 string1 int4 int5 string2 AAA BBB CCC DDD EEE FFF GGG .... MMM
random1 147 sub1 95 34 dewdfa3 15000 -1238 SBAASBAQSBARSBATSBAUSBAXBELAAX UVTWUVWDUWDUWDWW F DFADFADFA K20 54T 'NaN' -30.5 ....'NaN'
Related (and unanswered) question here: How to import unlabeled and missing columns into a pandas dataframe?