html can be extremely messy. I would therefore suggest to use something more highlevel than bash script. Since you had already tagged the question with the python-tag (rightly replaced with the bash tag in a later edit), let's go with python with BeautifulSoup.
EDIT: In comments to this answer the author of the OP clarified what the OP really wanted:
- Collect the contents of td tags in a html table.
As in:
<td class="bzt">data12</td></code>
- Additionally collect data from a link in the src attribute of one or more script tags in the same html file.
As in:
<script src="hq.sohujs.cn/list=data18" type="text/javascript" charset="gbk"></script>
Perform 1. and 2. for all html files in the current working directory.
Save this as a csv table with fields separated by TAB ("\t"
).
Working solution for python3 and BeautifulSoup
I extended the script from the earlier version of this answer to do this and added a some explanation in comments:
"""module import"""
from bs4 import BeautifulSoup
import glob
"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")
for filename in filenames:
"""parse each file with bs4"""
soup = BeautifulSoup(open(filename), 'html.parser')
"""obtain data from td tags"""
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]
"""clean data: remove empty strings"""
tdTextList = [td for td in tdTextList if not td=='']
"""obtain data from script tag attributes"""
scriptTags = soup.findAll("script")
for elementTag in scriptTags:
src_attribute = elementTag.attrs.get("src")
if src_attribute is not None:
src_elements = src_attribute.split("=")
if len(src_elements) > 1:
tdTextList.append(src_elements[1])
"""write data to output002.csv"""
with open("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")
How to run
From a terminal in the directory where the html files are, do:
python3 <script_name.py>
Alternatively, you can move the working directory to the correct location (where the html files are) at the beginning of the script with:
import os
os.chdir("</path/to/directory>")
Working solution for python2 and BeautifulSoup
Since the author of the OP requested a python2 version, I provide one here. The only difference to the python3 version above are the file handlers (python2 uses file()
, not open()
).
"""module import"""
from bs4 import BeautifulSoup
import glob
"""obtain list of all html files in cwd"""
filenames = glob.glob("*.html")
for filename in filenames:
"""parse each file with bs4"""
soup = BeautifulSoup(file(filename), 'html.parser')
"""obtain data from td tags"""
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]
"""clean data: remove empty strings"""
tdTextList = [td for td in tdTextList if not td=='']
"""obtain data from script tag attributes"""
scriptTags = soup.findAll("script")
for elementTag in scriptTags:
src_attribute = elementTag.attrs.get("src")
if src_attribute is not None:
src_elements = src_attribute.split("=")
if len(src_elements) > 1:
tdTextList.append(src_elements[1])
"""write data to output002.csv"""
with file("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")
Running the python2 version is analogous to python3 above.
Old version of this answer
The following script does what you describe:
collect all the contents of all html files in the current directory
write them to a csv with tab seperator.
Here is an example script:
from bs4 import BeautifulSoup
import glob
filenames = glob.glob("*.html")
tdTextList = []
for filename in filenames:
soup = BeautifulSoup(open(filename), 'html.parser')
tdTextList += [td.text for td in soup.find_all("td")]
with open("output001.csv", "w") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
This is what you describe. It is probably not what you want.
Note that this will produce a file with a single very long row (you do not specify when you want a new row). And it may accidentally produce a malformed file if the contents of any of the td tags contains a newline character.
To give an output file that looks a bit nicer, let's write a new line for each html file that is read and let's remove leading and trailing spaces as well as newline characters from the data before writing it to the output.
from bs4 import BeautifulSoup
import glob
filenames = glob.glob("*.html")
for filename in filenames:
soup = BeautifulSoup(open(filename), 'html.parser')
tdTextList = [td.text.strip().replace("\n","") for td in soup.find_all("td")]
with open("output002.csv", "a") as outputfile:
for tdText in tdTextList:
outputfile.write(tdText)
outputfile.write("\t")
outputfile.write("\n")
Note: you can run either script from the bash shell with:
python3 <script_name.py>