0

So I have a file which I can ope through Python read function, which returns one large string that essentially looks like a data frame, but is still a large string. So for example it could look something like this:

1609441 test.test1.test3    1/15.34 -1  100 622 669
160441  test.test1.test3    2/11.101    -1  100 140216  177363
16041   test2.test8.test6   2/15.34 -1  100 2791    2346
160441  test.test7.test5    2/15.34 1   100 Bin Any 5   1794    2346
1609441 test4.test4.test4   2/15.34 1   100 E   Any 5   997 0
1642    test4.test3.test1   28.0.101    -1  100 5409155 10357332

If it were a real data frame, it would look like:

1609441 test.test1.test3    1/15.34   -1    100   622       669
160441  test.test1.test3    2/11.101  -1    100   140216    177363
16041   test2.test8.test6   2/15.34   -1    100   2791      2346
160441  test.test7.test5    2/15.34   1     100   Bin       A          5    1794    2346
1609441 test4.test4.test4   2/15.34   1     100   E         A          5    997     0
1642    test4.test3.test1   28.0.101  -1    1     155       7332

So as can be seen the data varies a lot. Some has 10 rows of different data, some only have 7 - and so on. Again, this is a large text string, and I have tried read_csv and read_fwf, but I haven't really succeeded. Optimally it would just create a data frame with a fixed amount of columns (I know the maximum number of columns), and if doesn't have any value, well, just make a NaN value instead.

Can this be achieved in any way ?

Denver Dang
  • 2,433
  • 3
  • 38
  • 68

1 Answers1

1

I tried with read_csv and this looked like it worked:

t = '''1609441 test.test1.test3    1/15.34 -1  100 622 669
160441  test.test1.test3    2/11.101    -1  100 140216  177363
16041   test2.test8.test6   2/15.34 -1  100 2791    2346
160441  test.test7.test5    2/15.34 1   100 Bin Any 5   1794    2346
1609441 test4.test4.test4   2/15.34 1   100 E   Any 5   997 0
1642    test4.test3.test1   28.0.101    -1  100 5409155 10357332'''

with open('test.txt', 'w') as f:
    f.write(t)
    
pd.read_csv('test.txt', delim_whitespace=True, names=['1', '2', '3' ,'4', '5', '6' ,'7' ,'8', '9', '10'])

Does that not work with the full dataset?

Data frame image

Robert King
  • 974
  • 5
  • 16
  • I'm just wondering, can't this be done without having to save a `test.txt` file first ? This seems like an extra unnecessary step, which makes everything take longer, because it has to save all these rows into a file on the harddrive, instead of just having it in the RAM. – Denver Dang Dec 09 '20 at 07:04
  • You can use the `StringIO` method descriped in this answer: https://stackoverflow.com/questions/22604564/create-pandas-dataframe-from-a-string – Robert King Dec 11 '20 at 15:59