Extracting specific columns from html table in shell script

Question

I am not an expert in shell scripting, and I am struggling with finding a way to extract only specific columns from html table. I tried different options awk, grep, hxselect but unfortunately could not come up with solution.

hxselect requires that html is properly formatted which is not always the case for me. Here is the sample table

<table class="jiraIssueTable aui">
 <colgroup>
    <col width="18">
    <col width="90">
    <col>
    <col width="9%">
    <col width="9%">
    <col width="9%">
 </colgroup>
 <thead>
    <tr>
       <th id="Related issues-type">Type</th>
       <th id="Related issues-key">Key</th>
       <th id="jiraDetailsText">Summary</th>
       <th id="Related issues-status">Status</th>
       <th id="Related issues-assignee">Assignee</th>
       <th id="Related issues-fix-versions">Fix versions</th>
    </tr>
 </thead>
 <tbody>
    <tr class="" >
       <td class="jiraIssueIcon" headers="Related issues-type"> <img class="issueTypeImg" src="/images/icons/jira_type_unknown.gif" alt="Unknown Issue Type"> </td>
       <td class="jiraIssueKey" headers="Related issues-key"> <a title="View this issue" class="jiraIssueLink" data-issue-key="OL-541" id="viewIssueInJira:OL-541" href="">OL-541</a> </td>
       <td headers="jiraDetailsText" class="jiraIssueDetailsError"> Increase the performance </td>
       <td class="jiraIssueStatus" headers="Related issues-status"> </td>
       <td headers="Related issues-assignee" class="jiraIssueDetailsError"> </td>
       <td headers="Related issues-fix-versions" class="jiraIssueDetailsError"> </td>
    </tr>
    <tr class="" >
       <td class="jiraIssueIcon" headers="Related issues-type"> <a href="devStatusDetailDialog=build" title="View this issue"> <img class="issueTypeImg" src="rType=issuetype" alt="Task"/> </a> </td>
       <td class="jiraIssueKey" headers="Related issues-key"> <a title="View this issue" class="jiraIssueLink" data-issue-key="IT-2431" id="viewIssueInJira:IT-2431" href="">IT-2431</a> </td>
       <td headers="jiraDetailsText" class="jiraIssueDetails"> Get some sample data </td>
       <td class="jiraIssueStatus" headers="Related issues-status"> Verified/Closed </td>
       <td headers="Related issues-assignee" class="jiraIssueDetails"> User A </td>
       <td headers="Related issues-fix-versions" class="jiraIssueDetailsError"> </td>
    </tr>
 </tbody>
</table>

So from this table I only need 2 and 3 columns contents. Meaning my final results should look like this:

OL-541 Increase the performance

IT-2431 Get some sample data

Any help is appreciated

[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, May 20 '20 at 11:49

score 1 · Answer 1 · answered May 22 '20 at 11:15

This is how I solved my issue.

I was using hxselect to select certain table data from HTML. It requires that HTML is properly formatted which was not always the case for me. I had non-closed HTML tags in my file. Like above <img class="issueTypeImg" src="/images/icons/jira_type_unknown.gif" alt="Unknown Issue Type"> img tag is never closed and hxselect command was complaining about it.

Then I found that actually that hxclean filename.html actually fixes the format if there are broken HTML tags.

So this worked for me

Use hxclean filename.html
Aplpy hxselect

Thank you

Extracting specific columns from html table in shell script

1 Answers1