Hope this helps
There are a few ways you try and take the best approach that fits for you.
1. Specify the required columns while loading the data. (just like Andy L.
answer)
df = pd.read_excel(fileAddress, header=0, sheet_name='Sheet1',
usecols=['Name', 'Numbers', 'Address'])
2. Specify dtypes
Pandas, for every data read operation, does a heavy lifting job of identifying the data type by itself. This consumes both memory and time. Also, this needs the whole data to be read at a time.
To avoid it - Specify you column data types(dtype
)
Example:
pd.read_csv('sample.csv', dtype={"user_id": int, "username": object})
Available data types in pandas
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.int64,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.uint64]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
(as you can see the list is too long, so if you specify the dtypes it would speed up your job)
3. You use a converter in case you need help in data conversions in your data.
(Almost like 2, an alternative of 2).
In cases like null values or empty, you can easily deal here. (Disclaimer: I never tried this)
Example
def conv(val):
if not val:
return 0
try:
return np.float64(val)
except:
return np.float64(0)
df = pd.read_csv('sample.csv', converters={'COL_A':conv,'COL_B':conv})
4. Reading the data in chunks always helps.
chunksize = 10 ** 6
for chunk in pd.read_csv('sample.csv', chunksize=chunksize):
process(chunk)
One thing to note is to treat each chunk
like a separate data frame. Helps read larger files like 4 GB or 6 GB also.
5. Use pandas low_memery option.
Use (low_memory=False
) to explicitly tell pandas to load larger files into memory or in case you are getting any memory warning.
df = pd.read_csv('sample.csv', low_memory=False)