5

I am trying to pull the digit values (100.00 & 200.00) using pythons regular expressions , but when I invoke the code it doesn't yield anything... I am using python version 2.7

1) My file name is "file100" from where I need to opt the values..

# cat file100
Hi this doller 100.00
Hi this is doller 200.00

2) This is my python code..

# cat count100.py
#!/usr/bin/python
import re
file = open('file100', 'r')
for digit in file.readlines():
        myre=re.match('\s\d*\.\d{2}', digit)
        if myre:
           print myre.group(1)

3) While I am running this code , it does not yield anything , no error .. nothing ..

# python   count100.py
dsh
  • 12,037
  • 3
  • 33
  • 51
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53
  • pygo, did you find my answer helpful? – Russia Must Remove Putin Dec 21 '15 at 21:44
  • Side-note: `for digit in file.readlines():` is wasteful and delays processing (it slurps the whole file into memory before beginning iteration). `for digit in file:` iterates without slurping (so peak memory is based on the largest input line, not the size of the file). There is literally no use case for `file.readlines()`; in the rare case where you need a `list` of lines instead of iterating lines as you go, `list(file)` accomplishes the same result more generally/succinctly (it works on any non-infinite iterator, not just file-like objects with `.readlines()`). – ShadowRanger Dec 21 '15 at 21:46
  • 1
    @ShadowRanger I make that point in my answer below. – Russia Must Remove Putin Dec 21 '15 at 21:50
  • I don't know why you're using `group(1)`, you don't have a capture group in your regex. I believe I have concisely explained why you should only be using `group(0)` below. – Russia Must Remove Putin Dec 21 '15 at 21:53

4 Answers4

2

Use re.search instead:

import re
file = open('file.txt', 'r')
for digit in file.readlines():
    myre = re.search(r'\s\b(\d*\.\d{2})\b', digit)
    if myre:
        print myre.group(1)

Results

100.00
200.00

From the documentation:

Scan through string looking for the first location where the regular expression pattern produces a match

If you decided to use a group, parentheses are also needed:

(...) Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use ( or ), or enclose them inside a character class: [(] [)].

re.match is only valid:

If zero or more characters at the beginning of string match the regular expression pattern

r to enclose regex as raw strings:

String literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and use different rules for interpreting backslash escape sequences.

...

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C

Juan Diego Godoy Robles
  • 14,447
  • 2
  • 38
  • 52
  • It doesn't use the context manager, it materializes the entire file in memory, and it recompiles the regex every time in the loop. `digit` is also semantically wrong. – Russia Must Remove Putin Dec 21 '15 at 21:23
  • 1
    Where only solving the OP main problem, this is not a code revision tool , of course it can be improved in many ways but that's not the point here ... cheers – Juan Diego Godoy Robles Dec 21 '15 at 21:27
  • 2
    The question was why it didn't work and the answer given by @klashxx explains it why. If you want to improve the answer, you can give it as a separate answer, why downvote it? – helloV Dec 21 '15 at 21:27
  • @klashxx - This refine code working nicely, Though i am looking for the literal meaning of 'r' just before regex , Meanwhile i'll be reading it from python org. – Karn Kumar Dec 21 '15 at 21:32
  • 1
    `r` makes it a raw string, so your backslashes won't escape the characters they precede when Python parses the string. – Russia Must Remove Putin Dec 21 '15 at 21:37
1

If they are always at the end of your lines just rsplit once and pull the last element:

with open('file100', 'r') as f:
    for line in f:
        print(line.rsplit(None, 1)[1])

Output:

100.00
200.00

rsplit(None,1) just means we split once from the end of the string on whitespace, then we pull the second element:

In [1]: s = "Hi this doller 100.00"

In [2]: s.rsplit(None,1)
Out[2]: ['Hi this doller', '100.00']

In [3]: s.rsplit(None,1)[1]
Out[3]: '100.00'

In [4]: s.rsplit(None,1)[0]
Out[4]: 'Hi this doller'

If you really need a regex use search:

import re

with open('file100', 'r') as f:
    for line in f:
        m = re.search(r"\b\d+\.\d{2}\b",line)
        if m:
            print(m.group())
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
1

Your primary problem is that you're using re.match which requires a match starting from the beginning of the string, not re.search, which allows a match that can start at any point in the string. I'll break down my recommendations, though:

import re

No need to recompile on every loop (Python actually caches some regexes for you, but keep one in a reference to be safe). I'm using a VERBOSE flag to break apart the regex for you. Use a r to precede your string so that backslashes aren't escaping the characters they precede as Python reads the string:

regex = re.compile(r'''
  \s      # one whitespace character, though I think this is perhaps unnecessary
  \d*     # 0 or more digits
  \.      # a dot
  \d{2}   # 2 digits
  ''', re.VERBOSE) 

Use a context manager and open the file with universal newlines, 'rU' mode, so that no matter what platform the file was created on, you will be able to read it line by line.

with open('file100', 'rU') as file:

Don't use readlines, which loads the entire file into memory at once. Instead, use the file object as an iterator:

    for line in file:
        myre = regex.search(line) 
        if myre:
            print(myre.group(0)) # access the first group, there are no  
                                 # capture groups in your regex

My code prints:

100.00
200.00
Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
  • Briefly and nicely explained as well, I am newbie so still in learning process for python. – Karn Kumar Dec 21 '15 at 21:47
  • If you like it, you can now upvote it, and if it best answers your question, you can accept it as well. – Russia Must Remove Putin Dec 21 '15 at 21:49
  • @ Aaron - Why we are using `re.compile` , is it necessary or we can trim it. What is the use of it. I would also like to know "context manager" what you used in your earlier comment as well, Is this you talking about memory context? – Karn Kumar Dec 21 '15 at 22:34
  • @pygo regular expression strings have to be compiled before they can be used in a search. If you don't use a pre-compiled regex, then the semantics are that you are recompiling the search on every loop. It is a best practice to separate and life redundant code out of inner loops. It does not add complexity to your code, in fact it reduces it, in the context of a large program. – Russia Must Remove Putin Dec 21 '15 at 23:01
-1

There's a couple problems here:

  1. .match only looks for matches at the beginning of a string -- see search() vs. match().

  2. You're not using capture groups, so there's no reason why .group(1) of myre.group(1) would have any content

Here's an updated sample:

import re

file = """
Hi this doller 100.00
Hi this is doller 200.00
"""

for digit in file.splitlines():
    myre = re.search('\s\d*\.\d{2}', digit)
    if myre:
        print(myre.group(0))
Manu Phatak
  • 190
  • 1
  • 8