I have huge gzip
file (several GB) of tab-delimited text which I would like to parse into a pandas dataframe.
If the contents of this file were text, one would simply use .split()
, e.g.
file_text = """abc 123 cat 456 dog 678 bird 111 fish ...
moon 1969 revolution 1789 war 1927 reformation 1517 maxwell ..."""
data = [line.split() for line in file_text.split('\n')]
and then you could put the data into a pandas dataframe using
import pandas as pd
df = pd.DataFrame(data)
However, this isn't a text document. It is a tab-delimited file in a gzip, with several GB of data. What is the most efficient way to parse this data into a dataframe, using .split()
?
I guess the first step would be to use
import gzip
with gzip.open(filename, 'r') as f:
file_content = f.read()
and use .split()
on file_content
, but saving all GB to a single variable and then splitting would be inefficient. Is it possible to do this in "chunks"?