2

I have a text file like this:

john123:
1
2
coconut_rum.zip

bob234513253:
0
jackdaniels.zip
nowater.zip 
3

judy88009:
dontdrink.zip
9

tommi54321:
dontdrinkalso.zip
92

...

I have millions of entries like this.

I want to pick up the name and number which has a number 5 digits long. I tried this:

matches = re.findall(r'\w*\d{5}:',filetext2)

but it's giving me results which have at least 5 digits.

['bob234513253:', 'judy88009:', 'tommi54321:']

Q1: How to find the names with exactly 5 digits?

Q2: I want to append the zip files which is associated with these names with 5 digits. How do I do that using regular expressions?

Sounak
  • 4,803
  • 7
  • 30
  • 48

3 Answers3

3

That's because \w includes digit characters:

>>> import re
>>> re.match('\w*', '12345')
<_sre.SRE_Match object at 0x021241E0>
>>> re.match('\w*', '12345').group()
'12345'
>>>

You need to be more specific and tell Python that you only want letters:

matches = re.findall(r'[A-Za-z]*\d{5}:',filetext2)

Regarding your second question, you can use something like the following:

import re
# Dictionary to hold the results
results = {}
# Break-up the file text to get the names and their associated data.
# filetext2.split('\n\n') breaks it up into individual data blocks (one per person).
# Mapping to str.splitlines breaks each data block into single lines.
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
    # See if the name matches our pattern.
    if re.match('[A-Za-z]*\d{5}:', name):
        # Add the name and the relevant data to the file.
        # [:-1] gets rid of the colon on the end of the name.
        # The list comprehension gets only the file names from the data.
        results[name[:-1]] = [x for x in data if x.endswith('.zip')]

Or, without all the comments:

import re
results = {}
for name, *data in map(str.splitlines, filetext2.split('\n\n')):
    if re.match('[A-Za-z]*\d{5}:', name):
        results[name[:-1]] = [x for x in data if x.endswith('.zip')]

Below is a demonstration:

>>> import re
>> filetext2 = '''\
... john123:
... 1
... 2
... coconut_rum.zip
...
... bob234513253:
... 0
... jackdaniels.zip
... nowater.zip
... 3
...
... judy88009:
... dontdrink.zip
... 9
...
... tommi54321:
... dontdrinkalso.zip
... 92
... '''
>>> results = {}
>>> for name, *data in map(str.splitlines, filetext2.split('\n\n')):
...     if re.match('[A-Za-z]*\d{5}:', name):
...         results[name[:-1]] = [x for x in data if x.endswith('.zip')]
...
>>> results
{'tommi54321': ['dontdrinkalso.zip'], 'judy88009': ['dontdrink.zip']}
>>>

Keep in mind though that it is not very efficient to read in all of the file's contents at once. Instead, you should consider making a generator function to yield the data blocks one at a time. Also, you can increase performance by pre-compiling your Regex patterns.

  • You should probably wrap parentheses from the start of the string to before the colon so that the colon is not included in the username string. – anon582847382 Nov 09 '14 at 18:46
  • Thank you. How do I make a list with this regex and the zip files which are under this username? – Sounak Nov 09 '14 at 18:47
  • @new_coder - Sorry about the delay; something important popped up. My edited post answers your second question. –  Nov 09 '14 at 20:58
  • Hi. One more thing. What if I don't want to hardcode the number 5? something like this -----------------------------> if re.match('[A-Za-z]*\d{num}:', name): #where num = 5 is it possible to do that? – Sounak Nov 10 '14 at 04:43
  • 1
    @new_coder - You can use [string formatting](https://docs.python.org/3/library/string.html#formatstrings) to insert whatever number you want: `'[A-Za-z]*\d{{{num}}}:'.format(num=5)` yields `'[A-Za-z]*\d{5}:'`. Note that you need the extra curly braces since `{...}` denotes a format field. –  Nov 10 '14 at 16:15
  • @iCodez Hi I tried like this def func(self, num): num = 5 match = re.search('[A-Za-z]*\d{{{num}}}:'.format(num), text) but it always gives me a key error. KeyError: 'num' – Sounak Nov 19 '14 at 12:52
  • I found the solution here. http://stackoverflow.com/questions/6930982/variable-inside-python-regex – Sounak Nov 19 '14 at 13:12
1
import re

results = {}

with open('datazip') as f:
    records = f.read().split('\n\n')

for record in records:
    lines = record.split()
    header = lines[0]

    # note that you need a raw string
    if re.match(r"[^\d]\d{5}:", header[-7:]):

        # in general multiple hits are possible, so put them into a list
        results[header] = [l for l in lines[1:] if l[-3:]=="zip"]

print results

Output

{'tommi54321:': ['dontdrinkalso.zip'], 'judy88009:': ['dontdrink.zip']}

Comment

I tried to keep it very simple, if your input is very long you should, as suggested by iCodez, implement a generator that yields one record at a time, while for the regexp match I tried a little optimization searching only the last 7 characters of the header.

Addendum: a simplistic implementation of a record generator

import re

def records(f):
    record = []
    for l in f:
        l = l.strip()
        if l:
            record.append(l)
        else:
            yield record
            record = []
    yield record

results = {}
for record in records(open('datazip')):
    head = record[0]
    if re.match(r"[^\d]\d{5}:", head[-7:]):
        results[head] = [ r for r in record[1:] if r[-3:]=="zip"]
print results
gboffi
  • 22,939
  • 8
  • 54
  • 85
0

You need to limit the regex to the end of the word so that it wont match any further using \b

[a-zA-Z]+\d{5}\b

see for example http://regex101.com/r/oC1yO6/1

The regex would match

judy88009:

tommi54321:

python code would be like

>>> re.findall(r'[a-zA-Z]+\d{5}\b', x)
['judy88009', 'tommi54321']
nu11p01n73R
  • 26,397
  • 3
  • 39
  • 52