2

The follow are two rows from my unlabeled dataset, a small subset:

random1 147 sub1    95  34  dewdfa3 15000   -1238   SBAASBAQSBARSBATSBAUSBAXBELAAX  AAA:COL:UVTWUVWDUWDUWDWW    BBB:COL:F   CCC:COL:GTATGTCA    DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5   HHH:COL:000.1   III:COL:2   JJJ:COL:0   

random2 123 sub1    996 12  kwnc239 10027    144        LBPRLBPSLBRDLBSDLBSLLBWB    AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F   DDD:COL:CACGTCGG    EEE:COL:K19 FFF:COL:HCC16   GGG:COL:873 III:COL:-77 JJJ:COL:0   KKK:COL:0   LLL:COL:1   MMM:COL:212

The first nine columns are consistent throughout the dataset, and could be labeled.

My problem is with the following columns. Each value in this row is then labeled with the column value first, e.g. AAA:COL:UVTWUVWDUWDUWDWW is column AAA, BBB:COL:F is column BBB, etc.

However, (1) each row does not have the same number of columns and (2) some columns are "missing". The first row is missing column FFF, the second row skips column CCC and HHH.

Also, notice that the first row stops at column JJJ, while the second column stops at column MMM.

How would one allocate 9 + 13 columns of a dataframe, and parse these values such that if a column:value pair didn't exist, this column would have a NaN value.

Would something like pandas.read_table() have the functionality for this?

This is the "correct" format for the first row:

random    int     sub    int2    int3    string1    int4    int5    string2                         AAA            BBB    CCC    DDD    EEE    FFF    GGG .... MMM
random1   147    sub1    95      34      dewdfa3    15000   -1238   SBAASBAQSBARSBATSBAUSBAXBELAAX  UVTWUVWDUWDUWDWW    F   DFADFADFA   K20 54T 'NaN' -30.5 ....'NaN'

Related (and unanswered) question here: How to import unlabeled and missing columns into a pandas dataframe?

Community
  • 1
  • 1
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234

1 Answers1

1

This will do it:

text = """random1 147 sub1    95  34  dewdfa3 15000   -1238   SBAASBAQSBARSBATSBAUSBAXBELAAX  AAA:COL:UVTWUVWDUWDUWDWW    BBB:COL:F   CCC:COL:GTATGTCA    DDD:COL:K20 EEE:COL:54T GGG:COL:-30.5    HHH:COL:000.1   III:COL:2  JJJ:COL:0   
random2 123 sub1    996 12  kwnc239 10027    144        LBPRLBPSLBRDLBSDLBSLLBWB    AAA:COL:UWTTUTUVVUWWUUU BBB:COL:F   DDD:COL:CACGTCGG    EEE:COL:K19 FFF:COL:HCC16   GGG:COL:873 III:COL:-77 JJJ:COL:0   KKK:COL:0   LLL:COL:1   MMM:COL:212"""

data = [line.split() for line in text.split('\n')]
data1 = [line[:9] for line in data]
data2 = [line[9:] for line in data]

# list of dictionaries from data2, where I parse the columns
dict2 = [[dict([d.split(':COL:') for d in d1]) for d1 in data2]

result = pd.concat([pd.DataFrame(data1),
                    pd.DataFrame(dict2)],
                   axis=1)

result.iloc[:, 9:]

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Above, I only included two rows. In actuality, there are thousands of rows. Could you edit the above to parse between rows? – ShanZhengYang Jul 25 '16 at 20:32
  • I showed it that way to show all data. I had transposed. Apparently, that wasn't a good idea. `result` is your answer. `result.iloc[:, 9:]` is just to show the challenging columns, and so I could print them. – piRSquared Jul 25 '16 at 20:38
  • Sorry, I misunderstood the second line of code. Thanks, I'm not sure why I was confused previously. – ShanZhengYang Jul 25 '16 at 20:58