0

I have an input string:

"[u'$799,900', u'$1,698,000', u'$998,000', u'$1,299,000', u'$1,000,000', u'$499,950', u'$995,000', u'$998,000', u'$2,000,000', u'$988,000', u'$979,000', u'$1,285,000', u'$988,000', u'$579,000', u'$700,000', u'$1,100,000', u'$1,557,000', u'$999,888', u'$798,000', u'$998,000', u'$1,050,000', u'$888,000', u'$559,888', u'$774,900', u'$795,000', u'$850,000']","[u'3 bds ', u' 2 ba ', u' 1,361 sqft', u'4 bds ', u' 3 ba ', u' 2,845 sqft', u'3 bds ', u' 3 ba ', u' 1,534 sqft', u'3 bds ', u' 2 ba ', u' 1,762 sqft', u'5 bds ', u' 3 ba ', u' 2,398 sqft', u'2 bds ', u' 2 ba ', u' 956 sqft', u'4 bds ', u' 3 ba ', u' 1,840 sqft', u'3 bds ', u' 2 ba ', u' 1,212 sqft', u'3 bds ', u' 3 ba ', u' 1,878 sqft', u'3 bds ', u' 2 ba ', u' 1,240 sqft', u'3 bds ', u' 2 ba ', u' 1,207 sqft', u'3 bds ', u' 3 ba ', u' 1,905 sqft', u'3 bds ', u' 3.5 ba ', u' 1,591 sqft', u'2 bds ', u' 2 ba ', u' 946 sqft', u'2 bds ', u' 2 ba ', u' 1,067 sqft', u'4 bds ', u' 3 ba ', u' 2,254 sqft', u'5 bds ', u' 4 ba ', u' 2,744 sqft', u'3 bds ', u' 3 ba ', u' 1,291 sqft', u'4 bds ', u' 3 ba ', u' 1,480 sqft', u'3 bds ', u' 2 ba ', u' 1,513 sqft', u'4 bds ', u' 2 ba ', u' 1,846 sqft', u'9 bds ', u' 5 ba ', u' 3,336 sqft', u'2 bds ', u' 2 ba ', u' 983 sqft', u'4 bds ', u' 3 ba ', u' 1,476 sqft', u'3 bds ', u' 3 ba ', u' 1,872 sqft', u'2 bds ', u' 3 ba ', u' 1,459 sqft']"

From it, I need to extract the prices into a list of ints.

This is what I have tried so far:

import re

pattern_price = r'\[u\'\$.*?\]'
patternx = r"(.*?u.*?)(\d+\,\d+\,\d+|\d+\,\d+)"

with open(fpath, "r") as f:
    for line in f.readlines():
        lst = re.findall(pattern_price, line)      

    print len(lst) # I get list with 1 element?

    newlst = [x.split(patternx) for x in lst]
    print len(newlst) # I got 1 element again?

Answers to similar questions didn't help me: Link1 Link2

Community
  • 1
  • 1
oneday
  • 629
  • 1
  • 9
  • 32
  • Please provide the original string, pretty sure there are ways to split it directly. – Jan Jul 05 '16 at 08:28
  • I'm pretty sure that `|` is a typo, it should be `,`. Other than that, I'm not really sure what you are trying to do. Can you post examples of input and expected output? – Božo Stojković Jul 05 '16 at 08:32
  • @Slayther - It's not a typo - idea is to grep values in hundreds of thousands and millions - I checked it on regex101 with sample string and it works- I have posted the example of expected output and called it newlst - not sure what is missing - input string is lst - expected output is newlst – oneday Jul 05 '16 at 08:41
  • The expected output doesn't have 5 entries, it has 13. Unless that is a typo – Božo Stojković Jul 05 '16 at 08:43
  • @Slayther - I have made it in "BOLD" for convenience of reading input lst and expected output lst – oneday Jul 05 '16 at 08:46
  • 1
    `newlst = [799,900, 1,698,000, 998,000, 1,299,000,1,000,000]` has 13 entries. Is that what you really want to do? Or is that typo? – Božo Stojković Jul 05 '16 at 08:46
  • I think ur holding onto assumption of splitting with ',' - that is not the intention - In this case resultant list should have 5 elements – oneday Jul 05 '16 at 08:49
  • Alright, so that is a typo. – Božo Stojković Jul 05 '16 at 08:50
  • Can someone suggest me what is so wrong with question to downvote it ? I have given the entire string - which gives just 1 element and I want to split that list - I have shown what I have done and references I have gone through ? What is so bad about the question ? – oneday Jul 05 '16 at 08:52
  • It is very hard to understand what is the problem and expected solution. – Božo Stojković Jul 05 '16 at 08:54
  • @Slayther hmm ..If on the given entire string if you perform re.findall with step as mentioned - you would get a list with only 1 element - which has all values .. – oneday Jul 05 '16 at 08:56
  • How would it be a list with only 1 element? Aren't you asking for 5 elements? – Božo Stojković Jul 05 '16 at 08:57
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/116444/discussion-between-slayther-and-oneday). – Božo Stojković Jul 05 '16 at 08:59

1 Answers1

2

You have several problems in your code.


Create a variable which will hold the values

Unrelated to your current question, but important if you want to expand upon your solution:

You are iterating over lines of file, but aren't keeping a variable that holds the values you have gone through.

Yes, you are creating a list, but that list is re-created in the for loop for each line.

Hence, you will only get the last line of your file, while keeping others unprocessed.

To fix this, add a variable before the loop and add to it.

with open(fpath, "r") as f:
    lst = []
    for line in f.readlines():
        lst.append( ... )

The price pattern

You are capturing the whole part of the string that holds the prices. That is why you are getting only 1 match, not 1 match for every price.

To capture only the prices, you could use the following regex:

'''
\$             # Make sure the numbers start with dollar sign (Has to be escaped as it is special sign)
(              # Start capturing group, this is what we want as output
    [\d,]      # Match either a digit (0-9) or a comma ','
    {7,11}     # Match the previous expression 7 to 11 times, getting '100,000' up to '100,000,000'
)              # End the capturing group
'''

Splitting a string by regex expression

You are trying to split a string by a regex expression:

x.split(patternx)

What this does, is it takes the regex, acts as it was a separator string and not a regex expression.

So, it simply compares substrings to string, doesn't find any matches and simply returns the whole string back.

You should use re.split instead.


Extracting numbers from strings

Finally, you are left with strings that you have to convert to numbers and add them to the list.

To do this, you have to iterate over the list returned by the re.findall, get rid of commas and convert them to int.

prices = re.findall(pattern, line)
    for price in prices:
        number = int(price.replace(',', ''))
        lst.append(number)

Final code

import re

pattern = r'\$([\d,]{7,11})'

with open(fpath, "r") as f:
    lst = []
    for line in f.readlines():
        prices = re.findall(pattern, line)
        for price in prices:
            number = int(price.replace(',', ''))
            lst.append(number)
    print lst
Božo Stojković
  • 2,893
  • 1
  • 27
  • 52