0

While coding a python-based script, i met with a strange html_table which like:

 <tr>

                                        <td x:str="2020-09-27 18:36:05"></td>

                                        <td x:str="SMS"></td>

                                        <td x:str="AAA"></td>

                                        <td x:str="10658139"></td>

</tr>

I know I can use MS Excel to convert it to a normal .xls or .xlsx file, but I have too many this kind of files to convert. So I need coding a script to finish the hard job. I have tried to use pandas to handle it, but pandas can not recoginze the data from the file correctly.

I guess maybe VBA can handle this problem well, but what I am familiar with is just Python. So can anybody tell me which python library can be used to handle this kind of html-based data table?

Any advice would be much appreciated.

In fact I have found out an evil way to solve the problem using re. some code like:

f=re.sub(r'\sx\:str=\"(.+)\">', r">\1",f)

But it looks like too violent.

Puzzlist
  • 1
  • 1
  • 3
  • 1
    Have you tried Beautiful Soup? This allows you to parse HTML and turn it into python objects to browse. I'm pretty sure it would handle your situation as well. https://stackoverflow.com/questions/11709079/parsing-html-using-python – JustLudo Nov 04 '20 at 07:33
  • BeautifulSoup is best solution here, as @Ludo21South indicated – IoaTzimas Nov 04 '20 at 07:53
  • I am guessing that kind of file format is some kind of Evil MS file.Because I need re-organize the table data if using BeautifulSoup, so it should be my finally choice if I can not find any easier method later. – Puzzlist Nov 04 '20 at 07:58

2 Answers2

0

I have converted it successfully into a XLS file using python. I have used BeautifulSoup to parse this HTML. You need to first pip install xlwt Python Code:-

from bs4 import BeautifulSoup
import pandas as pd;
html=open("index.html");
soup = BeautifulSoup(html, 'html.parser')
lis=[]
for extract in soup.find_all('td'):
    lis.append(extract.get('x:str'))
df=pd.DataFrame(lis)

df.to_excel('out2.xls')
LFC
  • 50
  • 5
  • I am not sure if it can solve the problem because there are many table rows in the file,not just one row as the example I showed. – Puzzlist Nov 04 '20 at 08:47
  • The code I wrote, searches for all the 'td' tags containing the attribute 'x:str' If you want something else as a filter send rest of the HTML code. Without that it is difficult to help. Alternatively you can read the the beautifulSoup documentation filter out exactly what you need. If you have trouble you can also use Xpath for extraction. – LFC Nov 04 '20 at 12:45
0

Try with BeatifulSoup. I cannot provide the entire solution, as i don't know what the desired output should be exactly however these commands will do the 90% of the work:

from bs4 import BeautifulSoup
soup=BeautifulSoup('enter_your_html_code_as_text_here', 'lxml')
l=soup.find_all('td')
[i.get('x:str') for i in l]

This will give you a list of your cells for a specific row () After that, you can easily create a dataframe with all your rows and sav it to Excel

IoaTzimas
  • 10,538
  • 2
  • 13
  • 30