Column selection for very large tab-delimited text files

Question

I am working on a very large tab-delimited (about 20000 rows * 30000 columns) table in .txt form, which makes the file larger than 20GB. I checked the RAM of my desktop, it seems that there is no way I can read the file using the read.delim in R or the pandas in python directly. I am currently thinking about making a small subset of the table based on a filter of the columns, and the new text file is going to be 20000 rows * 1200 columns. I have saved the column names of my target files in another file already, and I guess my next step is just to check whether the colname of a given column in the original file is on my new list and subset the data frame based on that. My question is how could I do this filtering most concisely?

I know there is a chunking method in pandas, but it seems to me that it is more about cutting the data frame based on a particular number of rows. The bash script I had been working on was also about a slicing based on rows. I would be really appreciable if someone could suggest a concise way of filtering the large data frame based on the column name instead of the row names/numbers. If possible, I was more expecting a solution based on python (R or bash is also welcomed).

Could you just parse it the old fashioned way with the one row at a time, and only keep the column you want? — chumbaloo, Oct 19 '20 at 12:06
Thanks for this suggestion. It should be working anyway and I could already try to write such a script. But I would keep the question open just to check whether there is another solution, for example, some unknown in-built function as well. — Bin, Oct 19 '20 at 12:09
Does this answer your question? [How do I read a large csv file with pandas?](https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas) — Andrew, Oct 19 '20 at 12:10
You could only read the first line to get the header, than during reading with pandas `read_csv` you can specify `usecols` to read only part of the columns at once. — Stefan, Oct 19 '20 at 12:12
Thanks for all your suggestions. The chunk question Andrew was suggesting is slightly different since my filtering is on the column scale. But I think by summarizing all the suggestions you have offered, I would able to continue the process. — Bin, Oct 19 '20 at 12:15

score 1 · Accepted Answer · answered Oct 19 '20 at 12:09

1

Assuming the column you want is the 5th column, just process the file one line at a time as shown below.

output = list()
with open('yourfile.txt', 'r') as _f:
    for line in _f:
        output.append(line.split(',')[4]) 
print(output)

answered Oct 19 '20 at 12:09

chumbaloo

671
6
16

Column selection for very large tab-delimited text files

1 Answers1