0

I have an excel file with only one sheet. The size of the excel file is ~900 Mb and contains thousands of rows and hundreds of columns

I want to extract only a few columns (say Name, Numbers & Address) from the excel sheet and do data manipulations.

Since the excel file is huge, the traditional method of creating the data-frame using pandas and then extraction of columns takes a lot of time.

ExcelFile = pd.read_excel(fileAddress, sheet_name="Sheet1")

Is there a faster way to extract the columns from the excel file?

sam
  • 1,819
  • 1
  • 18
  • 30
Prashant Kumar
  • 501
  • 8
  • 26

4 Answers4

2

You may pass usecols to read_excel to import only specific columns from excel to df. If you use pandas 0.24+, read_excel is able to read directly on columns values, so just pass usecols with list of columns values

df = pd.read_excel(fileAddress, header=0, sheet_name='Sheet1', 
                                usecols=['Name', 'Numbers', 'Address'])

On pandas < 0.24, usecols doesn't understand excel cell values. You need to know Excel column letters corresponding to Name, Numbers, Address or their integer locations.

For example: Name is at B; Numbers at G; Address at AA

df = pd.read_excel(fileAddress, header=0, sheet_name='Sheet1', usecols='B,G,AA')

If you know their integer locations, you may use them in place of 'B', 'G', 'AA' such as usecols=[1, 6, 26]

Andy L.
  • 24,909
  • 4
  • 17
  • 29
2

Hope this helps

There are a few ways you try and take the best approach that fits for you.

1. Specify the required columns while loading the data. (just like Andy L. answer)

df = pd.read_excel(fileAddress, header=0, sheet_name='Sheet1', 
                                usecols=['Name', 'Numbers', 'Address'])

2. Specify dtypes

Pandas, for every data read operation, does a heavy lifting job of identifying the data type by itself. This consumes both memory and time. Also, this needs the whole data to be read at a time.

To avoid it - Specify you column data types(dtype)

Example:

pd.read_csv('sample.csv', dtype={"user_id": int, "username": object})

Available data types in pandas

[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

(as you can see the list is too long, so if you specify the dtypes it would speed up your job)

3. You use a converter in case you need help in data conversions in your data.

(Almost like 2, an alternative of 2).

In cases like null values or empty, you can easily deal here. (Disclaimer: I never tried this)

Example

def conv(val):
    if not val:
        return 0    
    try:
        return np.float64(val)
    except:        
        return np.float64(0)

df = pd.read_csv('sample.csv', converters={'COL_A':conv,'COL_B':conv})

4. Reading the data in chunks always helps.

chunksize = 10 ** 6
for chunk in pd.read_csv('sample.csv', chunksize=chunksize):
    process(chunk)

One thing to note is to treat each chunk like a separate data frame. Helps read larger files like 4 GB or 6 GB also.

5. Use pandas low_memery option.

Use (low_memory=False) to explicitly tell pandas to load larger files into memory or in case you are getting any memory warning.

df = pd.read_csv('sample.csv', low_memory=False)
sam
  • 1,819
  • 1
  • 18
  • 30
0

you can copy the columns of your interest from the file.xlsx to another.xlsx and then make the reading with pandas from another.xlsx

Jogabell
  • 19
  • 6
0

You can look up here, because pandas provide such specific methods.

But more natively it will work like that:

import csv
import toolz.curried as tc
import pandas as pd

def stream_csv(file_path):
    with open(file_path) as f:
        yield from csv.DictReader(f, delimiter='\t')  # you can use any delimiter

file_path = '../../data.csv'
relevant_data = map(tc.keyfilter(lambda column_name: column_name in ['a', 'b']),
                                stream_csv(file_path))

pd.DataFrame(relevant_data)

Note that everything but pandas ist a generator function and thus is memory efficient.

Drey
  • 3,314
  • 2
  • 21
  • 26