Html dataset is still dataset. In order to read faster large data sets in Pandas, you can choose different strategies, it applies to read_html aswell:
1.Sampling
2.Chunking
3.Optimising Pandas dtypes
- Sampling. The most simple option is sampling your dataset.
import pandas
import random
filename = "data.csv"
n = sum(1 for line in open(filename))-1 # Calculate number of rows in file
s = n//10 # sample size of 10%
skip = sorted(random.sample(range(1, n+1), n-s)) # n+1 to compensate for header
df = pandas.read_csv(filename, skiprows=skip)
- Chunks / Iteration
If you do need to process all data, you can choose to split the data into a number of chunks (which in itself do fit in memory) and perform your data cleaning and feature engineering on each individual chunk
import pandas
from sklearn.linear_model import LogisticRegression
datafile = "data.csv"
chunksize = 100000
models = []
for chunk in pd.read_csv(datafile, chunksize=chunksize):
chunk = pre_process_and_feature_engineer(chunk)
# A function to clean my data and create my features
model = LogisticRegression()
model.fit(chunk[features], chunk['label'])
models.append(model)
df = pd.read_csv("data_to_score.csv")
df = pre_process_and_feature_engineer(df)
predictions = mean([model.predict(df[features]) for model in models], axis=0)
- Optimise data types
When loading data from file, Pandas automatically infers the datatypes. Very convenient of course, however, often these datatypes are not optimal and take up more memory than needed. We will go over the three most common datatypes used by Pandas — int, float and object — and show how to decrease their memory imprint while looking at an example.
Another way to drastically reduce the size of your Pandas Dataframe is to transform columns of dtype Object to category.