How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?

Question

So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after checking "for line in file".

I think I can either somehow split it, strip it, or partition, but I also wrote a regex which I used compile on and so if that returns a match object I don't think I can use that with those string based operations. Also I'm not sure whether my regex is greedy enough or not...

I'd like to store all instances of those found hits as strings within either a tuple or a list.

Here is my regex:

regex = re.compile("(<(\d{4,5})>)?")

I don't think I need to include all that much code considering its fairly basic so far.

Is your file too large to hold the whole thing in memory at one time? — Josiah, May 07 '12 at 05:57
well the end use of this is a module which returns a list or tuple that can be checked against? So, I'm not sure but that's the end use I'd like to have. — Carl Carlson, May 07 '12 at 06:01
Well, there's the function re.findall() which returns a list of all matches in the file, so if you read the file into a string (.read()) you can just run that on it and it gives you a list of match objects. However, if the file is too large for memory, you would need to read it one line at a time (or however else you want to split it up) — Josiah, May 07 '12 at 06:03
well I found out the file is 651 kb, but I'd like to limit using too much memory if possible and I've heard that doing it line by line is much safer? — Carl Carlson, May 07 '12 at 06:06
A file would have to be gigabytes in size for it to be an issue. The problem with doing it line by line is that your matches will only be indexes within each line you read, rather than an index to the entire file. You could work around that, but it's probably not necessary. — Josiah, May 07 '12 at 06:07

score 75 · Accepted Answer · edited Jun 10 '19 at 16:01

75

import re
pattern = re.compile("<(\d{4,5})>")

for i, line in enumerate(open('test.txt')):
    for match in re.finditer(pattern, line):
        print 'Found on line %s: %s' % (i+1, match.group())

A couple of notes about the regex:

You don't need the ? at the end and the outer (...) if you don't want to match the number with the angle brackets, but only want the number itself
It matches either 4 or 5 digits between the angle brackets

Update: It's important to understand that the match and capture in a regex can be quite different. The regex in my snippet above matches the pattern with angle brackets, but I ask to capture only the internal number, without the angle brackets.

More about regex in python can be found here : Regular Expression HOWTO

edited Jun 10 '19 at 16:01

LuftWaffle

187
1
3
19

answered May 07 '12 at 06:14

Eli Bendersky

263,248
89
350
412

what do you mean outer (...)? Are you saying that I can match all 4-5 digit #'s between the angle brackets? Cause that is what I wanted to do, except I was planning on matching including the angle brackets but then using rsplit and lsplit iteratively. – Carl Carlson May 07 '12 at 06:27
@CarlCarlson: Compare your regex with mine. I placed capturing parens `(...)` only around the number. You did around the number *and* the angle brackets. So your match will return both - and you only need the first IIUC. **See also my answer update** – Eli Bendersky May 07 '12 at 06:28
I think I understand match and capture a little bit better, but just to be clear, you are not implying that I mean to use anchoring right? Because I only want instances of numbers between angle brackets. – Carl Carlson May 07 '12 at 06:35
Not sure what anchoring has to do with it – Eli Bendersky May 07 '12 at 06:36
2

@CarlCarlson: in general, do yourself a favor and spend 20 minutes reading http://docs.python.org/library/re.html - these 20 minutes will pay themselves off many times over – Eli Bendersky May 07 '12 at 06:46

score 40 · Answer 2 · answered May 07 '12 at 06:13

Doing it in one bulk read:

import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)

Line by line:

import re

textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += reg.findall(line)
textfile.close()

But again, the matches that returns will not be useful for anything except counting unless you added an offset counter:

import re

textfile = open(filename, 'r')
matches = []
offset = 0
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += [(reg.findall(line),offset)]
    offset += len(line)
textfile.close()

But it still just makes more sense to read the whole file in at once.

what exactly is an offset counter and what is the purpose? Why would I not be able to call this module that returns a list and check if strings in the list match another string? — Carl Carlson, May 07 '12 at 06:24
Oh, I didn't understand that in the original question, if that's what you want to do the offset counter is unnecessary. I assumed you wanted to know where in the file the strings occurred, I apologize. — Josiah, May 07 '12 at 06:26

How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?

2 Answers2

Linked