0

I have huge gzip file (several GB) of tab-delimited text which I would like to parse into a pandas dataframe.

If the contents of this file were text, one would simply use .split(), e.g.

file_text = """abc   123   cat   456   dog   678   bird   111   fish   ...
moon   1969    revolution    1789   war   1927   reformation    1517    maxwell   ..."""

data = [line.split() for line in file_text.split('\n')]

and then you could put the data into a pandas dataframe using

import pandas as pd
df = pd.DataFrame(data)

However, this isn't a text document. It is a tab-delimited file in a gzip, with several GB of data. What is the most efficient way to parse this data into a dataframe, using .split()?

I guess the first step would be to use

import gzip
with gzip.open(filename, 'r') as f:
    file_content = f.read()

and use .split() on file_content, but saving all GB to a single variable and then splitting would be inefficient. Is it possible to do this in "chunks"?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • Are you saying you have a continuous string of data, without any line breaks, and you want to split this by tabs? – Burhan Khalid Sep 02 '16 at 05:34
  • what is wrong with `pd.read_csv('/path/to/file.csv.gz', sep='\t')` or `sep='\s*'` (if you have spaces instead of tabs as a delimiters)? Of course you can also use a `chunksize` parameter if you want to read your file in chunks... – MaxU - stand with Ukraine Sep 02 '16 at 05:35
  • @MaxU It is around 100 GB. Should I try this in chunks? I think pd.read_csv() could work, it just may not work all at once. This is my motivation for splitting the data and then importing. – ShanZhengYang Sep 02 '16 at 05:37
  • If the zipped file is 100Gb in size them you are highly unlikely to have enough RAM to fit the entire array into memory. You will probably need to split the data up into smaller manageable chunks, process each chunk separately, and then aggregate the data. – Dunes Sep 02 '16 at 05:54
  • @Dunes Exactly. How does one do this? – ShanZhengYang Sep 05 '16 at 01:16

1 Answers1

1

read_csv() supports GZIPped files, so you can simply do the following:

for chunk in pd.read_csv('/path/to/file.csv.gz', sep='\s*', chunksize=10**5):
    # process chunk DF

if you are sure that you have a TSV (TAB separated file), you can use sep='\t'

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Clarification: If I have a tsv file that is not gzipped, the above code snippet should work with only `sep = '\t'` replaced, correct? – ShanZhengYang Sep 08 '16 at 12:29
  • Also, what is the difference between `sep` in `pandas.read_csv()` and the above code where I use `file_text.split('\n')`. Is there a difference? Thanks for any help! – ShanZhengYang Sep 08 '16 at 12:36
  • This is also a `ValueError` I get with `for chunk in pd.read_csv('/path/to/file.csv.gz', sep='\s*', chunksize=10**5):` if the chunks are of unequal size. Sorry to ask so many questions, but I've opened a new thread here: http://stackoverflow.com/questions/39391597/valueerror-import-data-via-chunks-into-pandas-csv-reader – ShanZhengYang Sep 08 '16 at 12:58
  • @ShanZhengYang, i took a look at your new question - it's going to be difficult, because it's not a TSV/CSV file - one can call it "broken" TSV (as it has different number of columns). – MaxU - stand with Ukraine Sep 08 '16 at 15:02
  • And it's even more difficult if the columns are in the correct order. Any help would be appreciated :) – ShanZhengYang Sep 08 '16 at 15:08
  • Would pre-processing into a proper tsv file help? The problem is these files are huge, dozens of GB.... – ShanZhengYang Sep 08 '16 at 15:36
  • @ShanZhengYang, are you working on Linux/Unix or another OS? Preprocessing would definitely help, IMO. – MaxU - stand with Ukraine Sep 08 '16 at 16:09
  • Yes, Linux/Unix. I imagine there should be a pipeline somehow. I'm not sure how to preprocess without using `for chunk in pd.read_csv()` and then managing the `chunk`...? First, chunk/preprocess, then use chunk/read_csv? – ShanZhengYang Sep 08 '16 at 16:23
  • @ShanZhengYang, i would suggest you to open a new question with the following tags: `awk`, `sed`, `unix` and provide a sample of your input data and a desired / expected format (actually you didn't post it in your questions - how should your desired data set look like...). AWK and SED are very efficient tools so you can pipe your ungzipped stream to AWK or SED in order to preprocess your data... – MaxU - stand with Ukraine Sep 08 '16 at 16:28
  • Here is the new question: http://stackoverflow.com/questions/39398986/how-to-preprocess-and-load-a-big-data-tsv-file-into-a-python-dataframe – ShanZhengYang Sep 08 '16 at 19:48