Repeat text extraction with Python

Question

I have the following code which I would like to use to extract texts information between <font color='#FF0000'> and </font>. It works fine but it only extracts one unit (the first one) whereas I would like to extract all textual units between these tags. I tried to do this with a bash loop code but it didn't work.

import os

directory_path ='C:\\My_folder\\tmp'

    for files in os.listdir(directory_path):

    print(files)

    path_for_files = os.path.join(directory_path, files)

    text = open(path_for_files, mode='r', encoding='utf-8').read()

    starting_tag = '<font color='
    ending_tag = '</font>'

    ground = text[text.find(starting_tag):text.find(ending_tag)]

    results_dir = 'C:\\My_folder\\tmp'
    results_file = files[:-4] + 'txt'

    path_for_files = os.path.join(results_dir, results_file)

    open(path_for_files, mode='w', encoding='UTF-8').write(result)

i imagine you should use something like find_all if you want all more than one. — Padraic Cunningham, Dec 23 '14 at 11:58

Avinash Raj · Accepted Answer · 2014-12-23T14:48:04.383

2

You could use Beautiful Soup's css selectors.

>>> from bs4 import BeautifulSoup
>>> s = "foo <font color='#FF0000'> foobar </font> bar"
>>> soup = BeautifulSoup(s, 'lxml')
>>> for i in soup.select('font[color="#FF0000"]'):
    print(i.text)


 foobar

edited Dec 23 '14 at 14:48

answered Dec 23 '14 at 12:08

Avinash Raj

172,303
28
230
274

Thanks for your suggestion but I'm having troubles with BeautifulSoup - the same old problem: "ImportError: No module named BeautifulSoup" and none of proposed solutions will work for me. – user3635159 Dec 23 '14 at 14:33
you need to import beautifulsoup. Install it if it isn't already installed. – Avinash Raj Dec 23 '14 at 14:47
Yes, I know. I did install it but somehow cannot import it. I read different suggestions but none of them worked for me. I'm thinking now maybe the problem is that I have three Python versions installed on my computer. I have never had such a problem with other packages. – user3635159 Dec 23 '14 at 14:53
Hm, I managed to run BeautifulSoup on Cygwin but I get an error with :`AttributeError: 'str' object has no attribute 'text'` – user3635159 Dec 23 '14 at 15:07
I am using BeautifulSoup-3.2.1 - the only one that does run on my machine. – user3635159 Dec 23 '14 at 15:12
update to BeautifulSoup 4. or use `print(i.string)` instead of `print(i.text)` – Avinash Raj Dec 23 '14 at 15:14
well, I was having problems in installing BS4 but it works now. I your suggestion does the what I wanted. Just one stupid question? How to replace the line 's=' with a directory path so that the script runs on multiple files? Thanks. – user3635159 Dec 23 '14 at 15:37
you mean this `string.replace('s=', 'path')` – Avinash Raj Dec 23 '14 at 15:43
Yes. I mean to run the script on files from a folder and not only on one line. – user3635159 Dec 23 '14 at 16:20

score 0 · Answer 2 · answered Dec 23 '14 at 12:41

0

You can also use lxml.html 

>>> import lxml.html as PARSER
>>> s = "<html><body>foo <font color='#FF0000'> foobar </font> bar</body></html>"
>>> root = PARSER.fromstring(s)
>>> for i in root.getiterator("font"):
...   try: i.attrib["color"]
...   except:pass

answered Dec 23 '14 at 12:41

Vivek Sable

9,938
3
40
56

Is 's' here an httml file? Will that work if replace it with a directory containing a bunch of html or xml files? Also, what your script does is that it extracts '#FF0000' and I would like to extract a highlighted text which is between colour tags: ** text text text ** – user3635159 Dec 23 '14 at 13:14
"s" is content of the html file. We have to apply the "for" loop on the html/xml files from directory. use os.listdir("/tmp/target_html/") and file read method. yes, I miss text of the "font" tag. >>> root = PARSER.fromstring(s) >>> for i in root.getiterator("font"): ... try: ... if i.attrib["color"]=="#FF0000": ... print i.text ... except: ... pass – Vivek Sable Dec 23 '14 at 16:21
Thanks for your reply. I'm still pretty new with Python. Do you mind telling me how should I exactly combine your suggestion or what @Avinash Raj suggested with my script? – user3635159 Dec 23 '14 at 20:08
You can use anyone code, but before using code test on couple of test case(valid/invalid) Or can you shear sample code with test case so I can look and provide you solution. vivekbsable@gmail.com/vivek.igp (Skype ID) import lxml.html as PARSER def getFontTagText(content): """Input: Html Content. Output: List of Font tag text list.""" font_text = [] root = PARSER.fromstring(content) for i in root.getiterator("font"): try: if i.attrib["color"]=="#FF0000": font_text.append(i.text) except: pass return font_text – Vivek Sable Dec 24 '14 at 06:49
I think my question is not clear and I'm confusing you. I'll post it as another question with my try. Thanks. – user3635159 Dec 24 '14 at 08:34

Repeat text extraction with Python

2 Answers2

Linked