1

I have Excel files that I'm converting into HTML tables. As part of this conversion, there are columns that are grouped together like item C in the table below.

+---+---+
| A | B |
+---+---+
|   | D |
+ C +---+
|   | E |
+---+---+
| G | H |
+---+---+

In my final HTML, I'd like to apply a rowspan to those columns so that the table displays the same as the Excel file. As far as I can gather, there isn't a way to do this with Python. There is a way to do it with JavaScript and I've stumbled upon code that works perfectly as confirmed by JSFiddle.

function tableCheck() {
    const table = document.querySelector("table");

    let headerCell = null;

    for (let row of table.rows) {
        const firstCell = row.cells[0];

        if (headerCell === null || firstCell.innerText !== headerCell.innerText) {
            headerCell = firstCell;
        } else {
            headerCell.rowSpan++;
            firstCell.remove();
        }
    }
}
tableCheck()

The problem is that I cannot figure out how to apply this using js2py. I imagine I'm not passing the HTML file contents correctly into the JS function, but so far my guesses and searches have been unsuccessful.

My latest attempt is below. Any tips on how to get the JavaScript to work? Is there a way to do this with Python that I missed? Anything else I can provide to help describe expected output? Thank you!

import pandas as pd
import js2py
from js2py import eval_js

xlsx = pd.ExcelFile(filename)
sheetNumbers = len(xlsx.sheet_names)

for i in range (sheetNumbers):
    first_info = "blah"
    last_info = "blahblah"

    df = pd.read_excel(filename, sheet_name=i)
    df = df.fillna(method = 'ffill') # fills merged cells (single column) with merged value
    df.to_html("excelAsHTML.html", index=False)

    with open("excelAsHTML.html") as file:
        file = file.read()

js = """
function tableCheck() {
    const table = document.querySelector("table");

    let headerCell = null;

    for (let row of table.rows) {
        const firstCell = row.cells[0];

        if (headerCell === null || firstCell.innerText !== headerCell.innerText) {
            headerCell = firstCell;
        } else {
            headerCell.rowSpan++;
            firstCell.remove();
        }
    }
}
tableCheck(file)
"""
file = js2py.eval_js(js)

    with open("excelAsHTML.txt", "a") as file_to_write:
        file_to_write.write(first_info + file + last_info)
mvCode
  • 25
  • 5
  • https://stackoverflow.com/questions/10136319/executing-javascript-from-python – codester_09 Feb 21 '22 at 17:46
  • 1
    If you're creating the HTML, then of course it can be done in Python. You just add the appropriate "rowspan" attributes to the `` tags as you write them. Now, it's quite possible that the pandas automated `read_excel` and `to_html` methods don't give you enough flexibility, and in that case you might just need to create the layout yourself. – Tim Roberts Feb 21 '22 at 17:48
  • @TimRoberts So far, I can get the Excel file to output to HTML in the following format (hope it renders correctly in the comment). What I'd like to do is then merge those two C rows together, which is where the JavaScript (theoretically) comes in. My issue is either in calling that JavaScript code or passing in the file contents for the JS code to run correctly. Thanks! `+---+---+ | A | B | +---+---+ | C | D | +---+---+ | C | E | +---+---+ | G | H | +---+---+` – mvCode Feb 21 '22 at 18:01
  • @SharimIqbal Thank you - that answer is what got me to this point so far. However, when I test that code, I get the following error. `js2py.internals.simplex.JsException: SyntaxError: Line 7: Unexpected identifier` Adding this code still results in the same error: `.replace("document.write", "return ")` – mvCode Feb 21 '22 at 18:11
  • If you really insist on sticking with `to.html`, then a much better solution would be to use BeautifulSoup to parse the HTML, manipulate the DOM in Python, and then write it back out. I think BeautifulSoup is actually the module you're seeking here. – Tim Roberts Feb 21 '22 at 18:13
  • @TimRoberts I'm sure I'm wrong, but I thought BeautifulSoup was only able to take tables that are already formatted (i.e. the HTML table has a rowspan already applied). If that's incorrect, can you point me towards any documentation or examples of BeautifulSoup formatting a table with rowspans? – mvCode Feb 21 '22 at 18:26
  • BeautifulSoup reads HTML and converts to an object model, just like a browser does. You can then manipulate that object model, just like your Javascript is doing. BS4 can then convert that object model back into HTML. – Tim Roberts Feb 21 '22 at 18:54
  • @TimRoberts Can you provide any tips or resources on how I would go about manipulating the data to apply a rowspan to repeating rows? All resources I've found are how to take a table that already has rowspan attributes and convert that to a table. – mvCode Feb 21 '22 at 19:37
  • Your Javascript code is the exact template you need. Find a table, scan through the rows, look for cases where the first `` content is duplicated, and combine them. It's spelled quite differently, but the philosophy is EXACTLY the same. – Tim Roberts Feb 21 '22 at 22:01
  • @TimRoberts I don't mean to be dense, but I'm not a programmer by trade. This script is more of a fun side project than anything. I do appreciate you pointing me in the right direction, but I'm heavily dependent on search terms, and so far I'm running into a brick wall. The closest example I've found details multi indexing, but relies on columns with static headings. [link](https://stackoverflow.com/questions/49533330/pandas-data-frame-how-to-merge-columns#answer-49534143) These column headings will vary, so I'm not sure how applicable that example will be. – mvCode Feb 23 '22 at 20:55

1 Answers1

1

OK, here is Python code using BeautifulSoup that does what your Javascript does. Basically, if the text in the first <td> tag in each row is the same as the previous row, then we delete this tag an bump the "rowspan" for the previous tag.

from bs4 import BeautifulSoup

dom = BeautifulSoup(open('x.html').read(), 'lxml')
tbl = dom.find('table')
last = None
for row in tbl.find_all('tr'):
    this = row.find('td')
    if last and this.text == last.text:
        this.decompose()
        if 'rowspan' not in last:
            last['rowspan'] = 1
        last['rowspan'] += 1
    else:
        last = this
print(dom)

Before text:

<table>
    <tr>
        <td>A</td>
        <td>B</td>
    </tr>
    <tr>
        <td>C</td>
        <td>D</td>
    </tr>
    <tr>
        <td>C</td>
        <td>E</td>
    </tr>
    <tr>
        <td>G</td>
        <td>H</td>
    </tr>
</table>

After:

<html><body><table>
<tr>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td rowspan="2">C</td>
<td>D</td>
</tr>
<tr>

<td>E</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
</tr>
</table>
</body></html>
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • Thank you!! I've only done one test run, but so far this looks perfect. I'm sure I'll run into issues as I test/tinker more, but this is miles better than the headstart I was hoping for (and a much better solution than trying to get js2py to play nicely with Python). – mvCode Feb 23 '22 at 21:43