Reading a complex, large text file

Question

I have a very large text file which I am trying to load into jupyternotebook to perform analysis and etc..

But I can't seem to find a way to separate the columns? Thus far I have only had experience in working with hdf5 and csv files which are relatively easy to get a hang of.

I will attach a link to the data below,

https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-022-04496-5/MediaObjects/41586_2022_4496_MOESM3_ESM.txt

df1 = pd.read_csv('41586_2022_4496_MOESM3_ESM.txt', delimiter='\t')
print(df1.head(2))

result

       1    331.581577     -1.512106  17.774   2.143  -0.828   0.132     104.93    1092.57      45.54     7.355     1.359    -1.468     267695571003410291                   20111024-F5902-01-061    26.9  5520.3    40.0    3.951    0.116    1.581    0.430    2.296    0.188    0.339    0.041
0       2    332.300352     -1.566708   6.780   0...                                                                                                                                                                                                                                              
1       3    331.985497     -1.371940  18.426   1...

Thanks in advance :)

I mean that once I load the data in, I'm expecting to see 26 distinguished columns: each corresponding to a particular parameter like age, age uncertainty etc... but all of it is bunched up in a single column. — OverflownOverflow, Jul 02 '23 at 19:25
You propably didn't specift the delimiter. What's your delimiter in the `csv` file? You can specify one like this; `pd.read_csv('paths.txt', delimiter="|")`. Can't open your link since it's messed up. You should edit your question and state that there should be 26 columns. Large file, jupyter, etc. are irrelevant to your question. — doneforaiur, Jul 02 '23 at 19:26
I get a ParserError when I try to specify a delimiter, and sorry! Here is the link that should work: — OverflownOverflow, Jul 02 '23 at 20:47
https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-022-04496-5/MediaObjects/41586_2022_4496_MOESM3_ESM.txt — OverflownOverflow, Jul 02 '23 at 20:47
There's a a bunch of commented lines. Delete them by hand and try with the `delimiter="\t"` since it seems there's a tab between the columns. — doneforaiur, Jul 02 '23 at 20:49
Please include text as text, not images and especially not links to images. And please include a sample of the data format within the post, SO posts have to be self contained. — cafce25, Jul 02 '23 at 21:00
Ahh yes, I've tried that :( I edited the question and attached an image of what I get — OverflownOverflow, Jul 02 '23 at 21:01
@OverflownOverflow next time you ask question, please post the code and the result as I edited in your question. — Constantin Hong, Jul 02 '23 at 21:28

Constantin Hong · Accepted Answer · 2023-07-02T21:18:23.373

0

There is no tab in your CSV. Change the delimiter.

import pandas as pd

# https://stackoverflow.com/a/19633103/20307768
# '\s+': it says to expect one or more spaces. the matches will be as large as possible.
df1 = pd.read_csv('41586_2022_4496_MOESM3_ESM.txt', delimiter='\s+')
df1.head(2)

edited Jul 02 '23 at 21:18

answered Jul 02 '23 at 21:12

Constantin Hong

701
1
2
16

Yay! This works perfectly, thank you so much!! – OverflownOverflow Jul 02 '23 at 21:24

Reading a complex, large text file

1 Answers1