Python Regex to findall lines contains specific type of filenames

Question

I have a text file. I want to get the lines that contain a file-name only if the file-name is a .doc or a .pdf type file.

For example,

<TR><TD ALIGN="RIGHT">4.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD>
</TR>
<TR><TD ALIGN="RIGHT">5.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD>
</TR>

using python re.findall() I want to get the following lines.

<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>

Can any body please tell me any scalable way to define the pattern in the re.findall()?

its returning ['pdf', 'doc'] only....But I need whole line.... — mxant, May 15 '13 at 06:53
Statutory warning: [You can't parse HTML with regex](http://stackoverflow.com/a/1732454/1321855). (That shouldn't be a problem with this simple example, though.) — Anubhav C, May 15 '13 at 06:56
are you suggesting to loop through the lines and search each one of them???findall actually does that in an efficient manner provided we are giving it the correct pattern... — mxant, May 15 '13 at 06:59

jvallver · Answer 1 · 2013-05-15T07:01:13.680

You can use this regex:

(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)

Output:

>>> html = """<TR><TD ALIGN="RIGHT">4.</TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD>
... </TR>
... <TR><TD ALIGN="RIGHT">5.</TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>
... <TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD>
... </TR>"""
>>> re.findall("(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)", html)
['<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>', '<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>']

But I need the whole line...from to – mxant May 15 '13 at 07:02 — mxant, May 15 '13 at 07:02

score 1 · Answer 2 · answered May 15 '13 at 06:53

Something like this:

>>> strs="""<TR><TD ALIGN="RIGHT">4.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>L. Sam</TD>
</TR>
<TR><TD ALIGN="RIGHT">5.</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>
<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=72>G.K. Ram</TD>
</TR>"""

>>> [x for x in strs.splitlines() if re.search(r"[a-zA-Z0-9]+\.(pdf|doc)",x)]
['<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="ABC.pdf"> On Complex Analytic Manifolds</a></TD>',
 '<TD ALIGN="LEFT" VALIGN="TOP" WIDTH=50%><a href="DEF.doc"> On the Geometric theory of Fields</a>*</TD>'
]

actually I don't want to use string functions... I need to do it using regex only... — mxant, May 15 '13 at 06:55

score 1 · Answer 3 · answered May 15 '13 at 07:36

1

You can use both BeautifulSoup and re.

import BeautifulSoup
import re

lines = soup.findAll('href', text = re.compile('your regex here'), attrs = {'class' : 'text'})

with class your upper level header in the html code.

answered May 15 '13 at 07:36

kiriloff

25,609
37
148
229

Python Regex to findall lines contains specific type of filenames

3 Answers3