1

I have a lot of html files, and I want to extract the table and other information outside of the table in each html page and merge all the extracted information into one csv file or tab delimited file. Although there is a post of "Best method of extracting text from multiple html files into one CSV file", I try it using my html data, it is fast but the result is full of only one column of data, of course it ignored the information outside of the table.I have pre-processed the html files to output.txt , which include the information inside or outside of the table I needed with the bash command:

#!/bin/bash
for f in '*.html'; do   
    cat $f | sed -n '/tbody><tr/,/\/tbody>/p' > output.txt
done;

it is well done, and it leaves us a very clean informtaion of the table and other infromation I needed.

The part of the output.txt is just like this:

<tbody><tr><td><a href="fjzt-x.html?uid=NNNN">data11</a></td>
<td class="bzt">data12</td>
<td>data13</td>
    <td>data14</td>
<td>data15</td>
<td>data16</td>
<td>data17</td>
<td class="tdb"><span id="sNNNNN"></span></td>
<td class="tdb"><span id="zfNNNNN"></span></td>
<td class="bzt">--</td><td></td>
</tr>
<script src="https://hq.sohujs.cn/list=data18" type="text/javascript" charset="gbk"></script>
<script type="text/javascript">getprice1('NNNNN',NNNN,NNN);</script>
</code></pre>
<td><a href="fjzt-x.html?uid=NNNN">data21</a></td>
<td class="bzt">data22</td>
<td>data23</td>
    <td>data24</td>
<td>data25</td>
<td>data26</td>
<td>data27</td>
<td class="tdb"><span id="sNNNNN"></span></td>
<td class="tdb"><span id="zfNNNNN"></span></td>
<td class="bzt">--</td><td></td>
</tr>
<script src="https://hq.sohujs.cn/list=data28" type="text/javascript"  charset="gbk"></script>
<script type="text/javascript">getprice1('NNNNN',NNNN,NNN);</script>

...

I want the tab delimited Out Sample like this:

data11  data12  data13  data14  data15  data16  data17  data18

data21  data22  data23  data24  data25  data26  data27  data28

Could anyone help me? Bash or python command will be better.

james
  • 81
  • 7

2 Answers2

0

html can be extremely messy. I would therefore suggest to use something more highlevel than bash script. Since you had already tagged the question with the python-tag (rightly replaced with the bash tag in a later edit), let's go with python with BeautifulSoup.

EDIT: In comments to this answer the author of the OP clarified what the OP really wanted:

  1. Collect the contents of td tags in a html table.

As in:

<td class="bzt">data12</td></code>
  1. Additionally collect data from a link in the src attribute of one or more script tags in the same html file.

As in:

<script src="hq.sohujs.cn/list=data18" type="text/javascript" charset="gbk"></script>
  1. Perform 1. and 2. for all html files in the current working directory.

  2. Save this as a csv table with fields separated by TAB ("\t").

Working solution for python3 and BeautifulSoup

I extended the script from the earlier version of this answer to do this and added a some explanation in comments:

"""module import"""
from bs4 import BeautifulSoup
import glob

"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")

for filename in filenames:
    """parse each file with bs4"""
    soup = BeautifulSoup(open(filename), 'html.parser')

    """obtain data from td tags"""
    tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]

    """clean data: remove empty strings"""
    tdTextList = [td for td in tdTextList if not td=='']

    """obtain data from script tag attributes"""
    scriptTags = soup.findAll("script")
    for elementTag in scriptTags:
        src_attribute = elementTag.attrs.get("src")
        if src_attribute is not None:
            src_elements = src_attribute.split("=")
            if len(src_elements) > 1:
                tdTextList.append(src_elements[1])

    """write data to output002.csv"""
    with open("output002.csv", "a") as outputfile:
        for tdText in tdTextList:
            outputfile.write(tdText)
            outputfile.write("\t")
        outputfile.write("\n")

How to run

From a terminal in the directory where the html files are, do:

python3 <script_name.py>

Alternatively, you can move the working directory to the correct location (where the html files are) at the beginning of the script with:

import os
os.chdir("</path/to/directory>")

Working solution for python2 and BeautifulSoup

Since the author of the OP requested a python2 version, I provide one here. The only difference to the python3 version above are the file handlers (python2 uses file(), not open()).

"""module import"""
from bs4 import BeautifulSoup
import glob

"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")

for filename in filenames:
    """parse each file with bs4"""
    soup = BeautifulSoup(file(filename), 'html.parser')

    """obtain data from td tags"""
    tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]

    """clean data: remove empty strings"""
    tdTextList = [td for td in tdTextList if not td=='']

    """obtain data from script tag attributes"""
    scriptTags = soup.findAll("script")
    for elementTag in scriptTags:
        src_attribute = elementTag.attrs.get("src")
        if src_attribute is not None:
            src_elements = src_attribute.split("=")
            if len(src_elements) > 1:
                tdTextList.append(src_elements[1])

    """write data to output002.csv"""
    with file("output002.csv", "a") as outputfile:
        for tdText in tdTextList:
            outputfile.write(tdText)
            outputfile.write("\t")
        outputfile.write("\n")

Running the python2 version is analogous to python3 above.


Old version of this answer

The following script does what you describe:

  1. collect all the contents of all html files in the current directory

  2. write them to a csv with tab seperator.

Here is an example script:

from bs4 import BeautifulSoup
import glob

filenames = glob.glob("*.html")
tdTextList = []
for filename in filenames:
    soup = BeautifulSoup(open(filename), 'html.parser')
    tdTextList += [td.text for td in soup.find_all("td")]

with open("output001.csv", "w") as outputfile:
    for tdText in tdTextList:
        outputfile.write(tdText)
        outputfile.write("\t")

This is what you describe. It is probably not what you want.

Note that this will produce a file with a single very long row (you do not specify when you want a new row). And it may accidentally produce a malformed file if the contents of any of the td tags contains a newline character.

To give an output file that looks a bit nicer, let's write a new line for each html file that is read and let's remove leading and trailing spaces as well as newline characters from the data before writing it to the output.

from bs4 import BeautifulSoup
import glob

filenames = glob.glob("*.html")

for filename in filenames:
    soup = BeautifulSoup(open(filename), 'html.parser')
    tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]

    with open("output002.csv", "a") as outputfile:
        for tdText in tdTextList:
            outputfile.write(tdText)
            outputfile.write("\t")
        outputfile.write("\n")

Note: you can run either script from the bash shell with:

python3 <script_name.py>
0range
  • 2,088
  • 1
  • 24
  • 32
  • Very good solutions! I make some minor modifications and it can run in python2.7, there is still some information missed as data18,data28,... which are not in the table, of course the tilte is not appropriate. import codecs with codecs.open("output002.csv", "a",encoding="utf-8") outputfile.write("\n") – james Dec 05 '18 at 08:26
  • @james python2 and python3 have some differences, mainly related to the print statement (without brackets in python2) and the file IO (the function should be `file()`, not `open()` in python2). If you like, I can add a python2 version to the answer. – 0range Dec 06 '18 at 16:03
  • @james Could you elaborate on this: _there is still some information missed as data18,data28,... which are not in the table_ If I understand correctly, you want the strings "data18" and "data28" from the html to be scraped too. These strings appear in src urls of scripts included in the html. This is a special case compared to extracting from tables using td tags. It is possible to do this, but the solution would not be very generic. Is this the only special case? Also, could you include the html samle as code (instead of as picture) in your question? Would make it easier to verify solutions. – 0range Dec 06 '18 at 16:15
  • @james You should copy and post the html sample as code block, in just the same way you posted your code: Indented 4 characters, as also explained in the [markdown help](https://stackoverflow.com/editing-help#code). Note that html tags are not rendered but shown as html when inside code blocks. – 0range Dec 07 '18 at 13:02
  • ,Thank you very much.The code:
    
        data11
        data12
        data13
            data14
        data15
        data16
        data17
        
        
        --
        
        
        
    
    – james Dec 07 '18 at 14:19
  • I am sorry for the messy code.Briefly, each html page has the information I need between start and end, which include a table with 20 rows, for example, the output ---- "data11 data12 data13 data14 data15 data16 data17" , is one row of the table, however, data 18 is not in the table, it is actually in source code of the table. May be the Regex will work for it, but I do not know that much, I use piped sed commands to delete very thing exculde data11~data18, which was left as a long one column of dataN1~dataN8. – james Dec 07 '18 at 14:59
  • Finally, I use command " awk '{ ORS = (NR%8 ? "," : RS) } 1' file " to wrapp the one column data into 8 columns, but it failed with unknown reason. – james Dec 07 '18 at 15:03
  • @james You should edit the question and put additional information like this directly in the question. Please also put the html sample as code directly in the question, not in a comment. – 0range Dec 07 '18 at 15:31
  • @james: I extended the answer to collect your data18 and data28 items from the scr attributes of the script tags. Given how messy the source html is (with data to be collected from two different structures) I doubt very much that you will be able to do this in a bash/awk/sed one-liner with pipes and regular expressions. – 0range Dec 07 '18 at 16:34
  • ,Thank you very much! I have updated the html source code in the Question part. Also, I have tried your updated code in python, it looks excellent, when I added `import sys reload(sys) sys.setdefaultencoding('utf8')` the python script can work very well, the dataN8 is collected, though it appended on the tail, i.e, "data11...data17 data22...data27 data18 data28", it looks different from data11...data17 data18 data22...data27 data28 – james Dec 10 '18 at 05:29
0

Your sample data looks pretty clean. If this is indicative of how all the files are structured, using xmlstarlet with an XSLT stylesheet may be the easiest and cleanest way to go.

Kyle Banerjee
  • 2,554
  • 4
  • 22
  • 30
  • It is very powerful tool to be learned on UNIX shell command prompt and can be combined with shell script, Thanks you, Kyle. – james Dec 05 '18 at 06:05
  • I have installed that xmlstarlet, please give me more specific scripts or steps to do that job, thanks a lot. – james Dec 05 '18 at 12:38