Reading text file with varying number of columns in Python

Question

I am trying to read a .txt file which contains string entries using pandas. Different rows in this file have different number of columns. The file can be found here.

This is how I am trying to read the file.

pd.read_csv('file.txt', sep=r'\s+', header=None).values[:,1:].astype('<U100')

I get the following error when I use the above method to read the file:

ParserError: Error tokenizing data. C error: Expected 82 fields in line 4, saw 85

I read this Stackoverflow post. And, I tried this method now:

pd.read_csv('file.txt', error_bad_lines=False, sep=r'\s+', header=None).values[:,1:].astype('<U100')

The above method doesn't give any errors, but now multiple rows are being skipped during the reading of the file. Is there any way in which I can read the aforementioned file fully (all rows) without errors?

I would use with open(“file.txt”) as f: – Andrew Nov 29 '18 at 22:03 — Andrew, Nov 29 '18 at 22:03

keithpjolley · Answer 1 · 2018-11-29T22:40:03.990

1

This chucks a lot of the data (from 695 to 475 lines). But that file is garbage anyways. Best to preprocess it before it comes into python.

[ins] In [20]: df = pd.read_csv("/tmp/file.txt", delim_whitespace=True, error_bad_lines=False, warn_bad_lines=False, header=None)                                                               

[ins] In [21]: df.shape                                                                                                                                                                         
Out[21]: (474, 82)

edited Nov 29 '18 at 22:40

answered Nov 29 '18 at 22:03

keithpjolley

2,089
1
17
20

Thanks for the post, keithpjolley. I tested it. Unfortunately, the suggested method doesn't on the file that I have (link to which is given in my post). I get a file of shape `(695, 1)`. – Siddharth Satpathy Nov 29 '18 at 22:07
updated using the original `file.txt`. not sure there's much to salvage. – keithpjolley Nov 29 '18 at 22:40

score 1 · Accepted Answer · answered Nov 29 '18 at 22:45

You can use the _io.TextIOWrapper method readlines() to create out of your file a system of nested lists of string (one sublist for each of your lines in file). That's all pandas needs for building a DataFrame:

import pandas as pd

with open('file.txt', 'r') as f:
    file_lines = f.readlines()

keymap =  pd.DataFrame([string.split('\t') for string in file_lines])

This yields:

>>> keymap

             0         1         2         3         4          5         6    \
0    TF: onecut2     ttc14     zadh2      pygm    tiparp     mgat4a    man2a1   
1         ppi_28    cep135    zranb1      strn     stk24      strn3  fgfr1op2   
2         ppi_29     hspb1   rps6ka5       mbp    mapk13   mapkapk3    mapk11   
3    TF: pou2af1  slc25a12    zbtb25       unk      aif1     tmem54     apaf1   
4       TF: rara     kcnk4      gfer    trip10      cog6     srebf1     zgpat   
5         ppi_25      upf1     upf3a     rbm8a      xrn1       upf2      smg1   
6         ppi_26    eif4g3     eif4e    eif4a1   snora81     snord2    eif4a2   
7       TF: rarb     kcnk4      gfer    trip10      cog6     srebf1     zgpat   
8         ppi_20     traf3  nfatc2ip      cd40     traf2      traf1      ltbr   
9         ppi_21      bmp2    acvr2a      bmp7    acvr2b       bmp6     bmpr2   
10      TF: rarg     kcnk4      gfer    trip10      cog6     srebf1     zgpat   
11        ppi_23     tgif2     rbbp8      rnf8    mre11a        nbn    recql5   
12    TF: pou5f1  slc25a12    zbtb25       unk      aif1     tmem54     apaf1   
13       TF: apc     rab34      lsm3     calm2      rbl1      gapdh     prkce   
14      TF: elf2   sdccag8    pbxip1      ctsw   slc35f2       rara    fermt3   
15      TF: elf4    fermt3   tmem204    s100a4      ager      ptpn6     kdm6b   
16        ppi_24    hspa1b    hspa1a      sox9    dnajc3      apaf1     brsk1   
17       ppi_148      drg1    ncapg2      tal1      lyl1      ncapg    ncapd2   
18    TF: topors     cnpy4      rcn3      rtn2      abi2      kcnd1     lmnb1   
19       ppi_146      upf1     upf3a     rbm8a      xrn1       upf2      smg1   
20       ppi_147    ube2v1    ube2v2      tyms    zranb2   atp6v1b2    sssca1   
21       ppi_144    srebf2    tada2b    insig2    srebf1      klf13    zbtb7c   
22       ppi_145     mthfr     naa38     dhx16      lsm1    pyroxd1      lsm2   
23       ppi_142     ntrk1     sgsm3   rasgrf1      bdnf  kidins220     ntrk2   
24       ppi_143     copb2     arcn1      arl1     copg2       copa     tapbp   
25       ppi_140     rap2a     rap2b    ralgds    pik3ca      rap1a   rapgef5   
26       ppi_141    cxcl10      irf1      irf5      irf3       irf7     stat2   
27       ppi_204   mir196b      pbx2    pknox1      pbx1      meis2     meis1   
28        ppi_27     acvr1      bmp2      bmp7     smad1       btg2     smad6   
29     TF: stat6      rhoc      rdh5    pbxip1      ctsw       rxrb     mitd1   
..           ...       ...       ...       ...       ...        ...       ...   
666    TF: smad4    ndufs8     ahdc1      tpp1   cables1       rxrb      acy1   
667    TF: smad5     ahdc1      acy1      rara  tctex1d4     wnt10b   tmem204   
668    TF: gata4    zbtb25       id2      sdhd     ube2b      ahdc1   arl6ip5   
669     TF: hsf2      cbx4     ppm1l    celsr3     hoxa7      kdm6b      fli1   
670    TF: gata2    zbtb25       id2     arl4a     dctn3      ube2b   arl6ip5   
671    TF: smad1     ahdc1      rxrb      acy1      rara   tctex1d4    wnt10b   
672    TF: smad2     ahdc1      rxrb      acy1      rara   tctex1d4    wnt10b   
673    TF: gata1      mefv    dnajb2      pck2    zbtb25       rac2       id2   
674    TF: nr1h4      exd1     epha1   c1qtnf6      gfer       ulk3      rxrb   
675     TF: rxrg     kcnk4      gfer    trip10      cog6     srebf1     zgpat   
676     TF: rxra      nol7      exd1    hspbp1     kcnk4   arhgef37     epha1   
677     TF: rxrb     kcnk4  arhgef37      gfer    baiap3     trip10      cog6   
678    TF: nr1h3    hspbp1     kcnk4      rdh5      kars     trip10      cog6   
679    TF: ascl1     jmjd8   zc3h12a   ptprcap    ube2j2    tmem204   slc34a3   
680     TF: rest       acd      lhx3   gripap1     l1cam      hhatl   ptprcap   
681     TF: nfic    eif4g3    il10rb      gfer       nyx    arl6ip5   mettl10   
682     TF: crem    pitpna       acd      gfer   fam131a       tpp1     fscn1   
683      ppi_208  hist1h4c  hist1h4f  hist1h4d  hist1h4k   hist1h4j  hist1h4i   
684    TF: arntl      acy1    lrrc56   tmem204      zzz3      cirbp      fasn   
685    TF: nhlh1     smad6     brsk2   fam131a      idi1      f2rl1     ap4b1   
686     TF: myf6     jmjd8   zc3h12a   ptprcap    ube2j2    tmem204   slc34a3   
687   TF: stat5b      rdh5       ada   sdccag8    gpr182      casp2      ctsw   
688   TF: stat5a     rdh12     ttc32      rdh5       ada     pbxip1      tbx6   
689      TF: maz     jmjd8     ahdc1      rxrb      rara    slc34a3     cldn6   
690    TF: brca1     ahdc1      gps2  tctex1d4     cirbp       cbx4     ptpn6   
691     TF: hes1      tcf3    polr2l    lrrc56   tmem204       nck1    zfyve9   
692      TF: crx    trip10   fam131a      rxrb     ovol1     nfkbib    mrpl24   
693    TF: hand1   slc34a3     cirbp     ptpn6      fasn      kdm6b    zbtb7b   
694    TF: hand2   slc34a3     cirbp     ptpn6      fasn      kdm6b    zbtb7b   
695      TF: maf    dnmt3a     clcf1      acy1  tctex1d4      gapdh   plekhh3   

          7         8         9     ...          770    771      772     773  \
0      zswim5     tubd1   igf2bp3   ...         None   None     None    None   
1       sike1   cttnbp2     slmap   ...         None   None     None    None   
2     pla2g4a      atf2  mapkapk5   ...         None   None     None    None   
3        dok2    fam60a     rab4b   ...         None   None     None    None   
4        rxrb     clcf1    fyttd1   ...         None   None     None    None   
5        parn      edc4      dcp2   ...         None   None     None    None   
6       mknk1     pdcd4     mknk2   ...         None   None     None    None   
7        rxrb     clcf1    fyttd1   ...         None   None     None    None   
8       traf5  tnfrsf17  tnfrsf18   ...         None   None     None    None   
9      bmpr1a    bmpr1b      gdf9   ...         None   None     None    None   
10       rxrb     clcf1    fyttd1   ...         None   None     None    None   
11      rrm2b    fancd2   dclre1c   ...         None   None     None    None   
12       dok2    fam60a     rab4b   ...         None   None     None    None   
13       rrm1      irf4    actr1b   ...         None   None     None    None   
14     wnt10b   tmem204    s100a4   ...         None   None     None    None   
15     zbtb7b    rnf167    ppp1ca   ...         None   None     None    None   
16        mos      snrk     hsbp1   ...         None   None     None    None   
17     ncapd3      smc2      lmo1   ...         None   None     None    None   
18      agfg1   gtf2a1l     cbwd1   ...         None   None     None    None   
19       parn      slbp      dcp2   ...         None   None     None    None   
20      trip6     uchl3     usp9x   ...         None   None     None    None   
21     sec24b      scap    rnf139   ...         None   None     None    None   
22       lsm3     wdr44    echdc2   ...         None   None     None    None   
23       dok5      ngfr      shc2   ...         None   None     None    None   
24      copz2    sacm1l     copz1   ...         None   None     None    None   
25    rapgef6      mras    rasip1   ...         None   None     None    None   
26     pmaip1      mafb      irf9   ...         None   None     None    None   
27      hoxd9     hoxa9     hoxb1   ...         None   None     None    None   
28      bmpr2      zeb1     smad7   ...         None   None     None    None   
29      zadh2     snx13      cfl1   ...         None   None     None    None   
..        ...       ...       ...   ...          ...    ...      ...     ...   
666      rara  tctex1d4    wnt10b   ...         None   None     None    None   
667   slc34a3      grk6     kdm6b   ...         None   None     None    None   
668      rara    timm8b     daam1   ...         None   None     None    None   
669     taf10     armc5      zhx2   ...         None   None     None    None   
670      rxrb    mrpl49  tctex1d4   ...         None   None     None    None   
671    polr2l   tmem204   slc34a3   ...         None   None     None    None   
672    polr2l   tmem204   slc34a3   ...         None   None     None    None   
673    trip10      mxd3     arl4a   ...         None   None     None    None   
674      tpi1      rara     gapdh   ...         None   None     None    None   
675      rxrb     clcf1    fyttd1   ...         None   None     None    None   
676   c1qtnf6      gfer      rdh5   ...     mapkapk2  ptch1  creb3l4  rpl23a   
677    srebf1     zgpat      rxrb   ...         None   None     None    None   
678    srebf1    col7a1     tekt4   ...         None   None     None    None   
679     cirbp     ptpn6      fasn   ...         None   None     None    None   
680      ppa1      gpr6      syt6   ...         None   None     None    None   
681      rara     gapdh     atg9a   ...         None   None     None    None   
682  pafah1b1      mlf2    wnt10b   ...         None   None     None    None   
683  hist1h4h  hist1h4b  hist1h3c   ...         None   None     None    None   
684     kdm6b    cpsf3l     pprc1   ...         None   None     None    None   
685    zfyve9   slc34a3      syt6   ...         None   None     None    None   
686     cirbp     ptpn6      fasn   ...         None   None     None    None   
687      gmfg     vps53     ptpn6   ...         None   None     None    None   
688     casp2     cxcr2      ctsw   ...         None   None     None    None   
689      cbx4     thoc6    isyna1   ...         None   None     None    None   
690    isyna1     rnf44     hoxa7   ...         None   None     None    None   
691   slc34a3     cirbp      cbx4   ...         None   None     None    None   
692     cnot4    fbxl19    zbtb7b   ...         None   None     None    None   
693      pkn1     nr1d1    map2k3   ...         None   None     None    None   
694      pkn1     nr1d1    map2k3   ...         None   None     None    None   
695      klc1      il7r     kdm6b   ...         None   None     None    None   

      774      775     776    777    778      779  
0    None     None    None   None   None     None  
1    None     None    None   None   None     None  
2    None     None    None   None   None     None  
3    None     None    None   None   None     None  
4    None     None    None   None   None     None  
5    None     None    None   None   None     None  
6    None     None    None   None   None     None  
7    None     None    None   None   None     None  
8    None     None    None   None   None     None  
9    None     None    None   None   None     None  
10   None     None    None   None   None     None  
11   None     None    None   None   None     None  
12   None     None    None   None   None     None  
13   None     None    None   None   None     None  
14   None     None    None   None   None     None  
15   None     None    None   None   None     None  
16   None     None    None   None   None     None  
17   None     None    None   None   None     None  
18   None     None    None   None   None     None  
19   None     None    None   None   None     None  
20   None     None    None   None   None     None  
21   None     None    None   None   None     None  
22   None     None    None   None   None     None  
23   None     None    None   None   None     None  
24   None     None    None   None   None     None  
25   None     None    None   None   None     None  
26   None     None    None   None   None     None  
27   None     None    None   None   None     None  
28   None     None    None   None   None     None  
29   None     None    None   None   None     None  
..    ...      ...     ...    ...    ...      ...  
666  None     None    None   None   None     None  
667  None     None    None   None   None     None  
668  None     None    None   None   None     None  
669  None     None    None   None   None     None  
670  None     None    None   None   None     None  
671  None     None    None   None   None     None  
672  None     None    None   None   None     None  
673  None     None    None   None   None     None  
674  None     None    None   None   None     None  
675  None     None    None   None   None     None  
676  npff  prkcdbp  tmem25  bcl9l  ap2b1  klf15\n  
677  None     None    None   None   None     None  
678  None     None    None   None   None     None  
679  None     None    None   None   None     None  
680  None     None    None   None   None     None  
681  None     None    None   None   None     None  
682  None     None    None   None   None     None  
683  None     None    None   None   None     None  
684  None     None    None   None   None     None  
685  None     None    None   None   None     None  
686  None     None    None   None   None     None  
687  None     None    None   None   None     None  
688  None     None    None   None   None     None  
689  None     None    None   None   None     None  
690  None     None    None   None   None     None  
691  None     None    None   None   None     None  
692  None     None    None   None   None     None  
693  None     None    None   None   None     None  
694  None     None    None   None   None     None  
695  None     None    None   None   None     None  

[696 rows x 780 columns]

I hope this helps! Best!

D.

Reading text file with varying number of columns in Python

2 Answers2