0

Reading html tables in pandas for small size is ok, but the big files in range of 10MB or like 10000 rows/records in html table makes me wait for 10 minutes still no progress, where as same in csv is parsed quickly.

Kindly help speedup html table read in pandas, or getting this converted to csv.

file='testfile.html'
dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details')
#print(dfdefault)
df = dfdefault[0]
Abhinav Kumar
  • 177
  • 2
  • 5
  • 22
  • Could you please add some more details? Some more context maybe? – J...S May 02 '19 at 07:23
  • what have you tried in terms of actual code and if code is slow what different ways have you tried and do you understand why it's going slow. Speak to the duck https://meta.stackoverflow.com/questions/281270/what-does-rubber-duck-mean-in-debug-help – Andrew Allen May 02 '19 at 07:48
  • @J...S for fewer items like 50 it is fine, once the file is like 10MB or 1000- rows, i dont see any response when I try to read the html table in pandas. – Abhinav Kumar May 02 '19 at 07:56
  • @AndrewAllen file=testfile.html' dfdefault = pd.read_html(file, header = 0, match='Client Inventory Details') – Abhinav Kumar May 02 '19 at 07:56
  • @AndrewAllen , I have till now been working on csv files and working fine as expected, but from html files I am unable to read the data if data is big, I have the exact same data in csv, works fine, The thing why I am shifting to html is I donot have to do anything to get it, html comes to my email. – Abhinav Kumar May 02 '19 at 07:58
  • code need to be in the question. some idea of what html looks like and also what else you have tried. Convert html to csv has answer here (depends on html): https://stackoverflow.com/questions/38917958/convert-html-into-csv – Andrew Allen May 02 '19 at 08:02
  • @AndrewAllen added code to question, I have tried that link to convert html to csv, failed. The output was highly messed up. – Abhinav Kumar May 02 '19 at 08:10
  • If reading csv files is much faster see https://kite.com/python/examples/4420/beautifulsoup-parse-an-html-table-and-write-to-a-csv how to convert html table into csv. (assuming the HTML table is a simple table) – balderman May 02 '19 at 09:07

1 Answers1

-1

Html dataset is still dataset. In order to read faster large data sets in Pandas, you can choose different strategies, it applies to read_html aswell:

1.Sampling

2.Chunking

3.Optimising Pandas dtypes

  1. Sampling. The most simple option is sampling your dataset.
import pandas
import random

filename = "data.csv" 
n = sum(1 for line in open(filename))-1  # Calculate number of rows in file
s = n//10  # sample size of 10%
skip = sorted(random.sample(range(1, n+1), n-s))  # n+1 to compensate for header 
df = pandas.read_csv(filename, skiprows=skip)
  1. Chunks / Iteration If you do need to process all data, you can choose to split the data into a number of chunks (which in itself do fit in memory) and perform your data cleaning and feature engineering on each individual chunk
import pandas
from sklearn.linear_model import LogisticRegression
datafile = "data.csv"
chunksize = 100000
models = []
for chunk in pd.read_csv(datafile, chunksize=chunksize):
    chunk = pre_process_and_feature_engineer(chunk) 
    # A function to clean my data and create my features
    model = LogisticRegression()
    model.fit(chunk[features], chunk['label'])
    models.append(model)
df = pd.read_csv("data_to_score.csv")
df = pre_process_and_feature_engineer(df)
predictions = mean([model.predict(df[features]) for model in models], axis=0)
  1. Optimise data types When loading data from file, Pandas automatically infers the datatypes. Very convenient of course, however, often these datatypes are not optimal and take up more memory than needed. We will go over the three most common datatypes used by Pandas — int, float and object — and show how to decrease their memory imprint while looking at an example.

Another way to drastically reduce the size of your Pandas Dataframe is to transform columns of dtype Object to category.

PirrenCode
  • 444
  • 4
  • 14
  • 1
    You have not shown how to apply this strategies for reading an html file. While the first one may apply (if you don't need all the data), the second one is not even a valid parameter of read_html. Finally the third one may apply, but it would be useful to show how to do it, considering that read_html only option to do this is the converters parameter – Ivan Calderon May 26 '20 at 23:34