1

I am trying to read a very large text file (.dat file) - too large to open in full - but I am not interested in the full content (only a minor part).

The real text file has 32 columns and an unknown number of rows. I am only interested in the rows where the value in column nr 15 is less than 40 and the value in column 21 is between 10 - 25. Is there a way to specify these constraints when opening and loading the file in python so that I avoid wasting memory on the content I don't care about?

Here is an example:

Let's say we have a file of 100 data (25 rows and 4 columns) and we only want to read the rows (the full rows!) where the value in column nr 2 is less than 40 and the value in column nr 4 is between 10 - 25. How can we do this without first loading the full file?

import numpy as np
# Create some examplary fake data for the text file:
textfile_content = np.random.randint(100, size=100).reshape(25,4)
print(textfile_content)
# Save text file:
np.savetxt('file.dat', textfile_content, fmt='%10.5e')

The closest approach I have been able to google is this: Reading specific lines only (Python)
But it doesn't quite solve my problem, as this person wants to extract full columns, predefined by their column number and not a specific data value.

jpp
  • 159,742
  • 34
  • 281
  • 339
kirerik
  • 167
  • 1
  • 15
  • 1
    I suggest [`dask.dataframe`](http://dask.pydata.org/en/latest/dataframe.html#) to lazily filter. Then bring into memory in `pandas` and output to `numpy` array. – jpp Jun 10 '18 at 15:09
  • Okay, thank you @jpp Could you show me an example? – kirerik Jun 10 '18 at 16:11
  • 1
    @jpp I edited the code above to save a tex file as well. Thank you for trying to help! – kirerik Jun 10 '18 at 16:59

1 Answers1

1

You can use dask.dataframe to lazily load data with a filter. Internally, this library works by chunking. It just hides the chunking from the user and leverages Pandas syntax to make usage seamless.

import pandas as pd
import numpy as np
import dask.dataframe as dd

# file.csv
# 1 2 3 4
# 5 6 7 8
# ....
# 33 34 35 36
# 37 38 39 40

# read file - LAZY
df = dd.read_csv(r'C:\Temp\file.csv', header=None, sep=' ',
                 names=['col1', 'col2', 'col3', 'col4'])

# filter file - LAZY
df = df[df['col4'].between(10, 25)]

# apply computations and extract to pandas dataframe
df = df.compute()

# extract numpy array from pandas dataframe
arr = df.values

print(arr)

array([[ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17, 18, 19, 20],
       [21, 22, 23, 24]], dtype=int64)
jpp
  • 159,742
  • 34
  • 281
  • 339