0

I am trying to make a function where I use regular expressions conditionally. I am trying to extract attribute information about a product, and I have generalize a few different patterns that could help me to extract the data.

The working code that I have thus far is:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
import re


filename = '/PATH/TO/dataFILE'
with open(filename) as f:
    for line in f:
        m0 = re.compile('[a-z-A-Z-0-9--]+\s\([a-z-A-Z]+,\s[-0-9-]+\)')
        m1 = re.compile('[a-z-A-Z-0-9--]+\s\([0-9-]+,\s[a-z-A-Z-]+\)')
        if m0.findall(line):
            matching_words = m0.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Product: ' + cleanwords[1] +'\n' + 'Attribute: '+cleanwords[0]

Up until this point the code works and outputs properly - when I add the elif is where I have problem

        elif m1.findall(line):
            matching_words = m1.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Product: ' + cleanwords[2] +'\n' + 'Attribute: '+cleanwords[0]

An example of the datafiles that I am working with is (I provide parallel dummy data):

The cellphone DeluxeModel (Samsung, 2007) is the best on the market. It is possible that the LightModel (Apple, 2010) is also relevant. It has been said that NewModel (1997,Blackberry) could also be useful - but I don't know.

The desired result is

Company: Samsung Product: DeluxeModel
Company: Apple Product: LightModel
Company: Blackberry Product: NewModel

I have already consulted HERE and HERE regarding cascading and grouping methods for what I am trying to implement, but I am unable to see why my implementation is incorrect. Is there a way for me to adapt my code to provide the desired result?

UPDATED CODE

I have been trying different modifications - and I have been able to output results, however, each time that I add a new condition, the results become more restricted, is there any way that this can be optimized?

filename = '/PATH/TO/DATA'
with open(filename) as f:
    for line in f:
        m0 = re.compile('[a-z-A-Z-0-9--]+\s\([a-z-A-Z-0-9--]+,\s[a-z-A-Z-0-9--]+\) | [a-z-A-Z-0-9--]+\s\([A-Z][a-z-]+\)' )
        m1 = re.compile('[a-zA-Z0-9-]+\s\(>[0-9]+.[0-9]\%,\s[a-zA-Z0-9-]+\)')
        m2 = re.compile('[a-zA-Z0-9-]+\s\([a-zA-Z0-9-]+\),\s>[0-9]+.[0-9]\%')
        if m0.findall(line):
            matching_words = m0.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Company: ' + cleanwords[1] +'\n' + 'Product: '+cleanwords[0]
        if m1.findall(line):
            matching_words = m1.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Company: ' + cleanwords[2] +'\n' + 'Product: '+cleanwords[0]
        if m2.findall(line):
            matching_words = m2.findall(line)
            for word in matching_words:
                cleanwords = [x.strip(string.punctuation) for x in word.split()]
                if len(cleanwords[0]) > 2:
                    print 'Company: ' + cleanwords[1] +'\n' + 'Product: '+cleanwords[0]
Community
  • 1
  • 1
owwoow14
  • 1,694
  • 8
  • 28
  • 43
  • 1
    How about this? https://regex101.com/r/8PLa8K/2 – Mohammad Yusuf Dec 12 '16 at 17:12
  • @MohammadYusufGhazi thank you for the link, however, the problem that I am having is more related to `python` than the specific `regex` – owwoow14 Dec 12 '16 at 17:17
  • @MohammadYusufGhazi That pattern will capture *any* word followed by a word in parentheses, which is very common in writing. – moogle Dec 12 '16 at 17:20
  • 1
    @owwoow14 anything preventing you from looping over `[m0, m1]` and trying each regex in turn until a match, then use the rest of the code once? – Jon Clements Dec 12 '16 at 17:20
  • @JonClements My doubt (and hence assumption to use `if...elif`) is the fact that the different patterns that I am extracted need to be ordered differently when printing the output. For instance, in some cases the information within the parentheses is `('company','year)` and in other cases ('year', 'company'), having different vectorial positions for output. However, I am not an expert and I could be wrong with my `if...elif` solution, and any other way that is more desirable is welcome – owwoow14 Dec 12 '16 at 17:23

1 Answers1

1

Use a single regex and the if...elif is unnecessary.

import re

line='The cellphone DeluxeModel (Samsung, 2007) is the best on the market. It is possible that the LightModel (Apple, 2010) is also relevant. It has been said that NewModel (1997,Blackberry) could also be useful - but I don\'t know.'
t=re.compile('(\w+)\s\((\d+,)?\s?(\w+)')
q=t.findall(line)
for match in q:
  print('Company: {} Product: {}'.format(match[2],match[0]))

Outputs:

Company: Samsung Product: DeluxeModel
Company: Apple Product: LightModel
Company: Blackberry Product: NewModel
depperm
  • 10,606
  • 4
  • 43
  • 67
  • Thank you for the insight. I wanted to incorporate `if...elif` in order to be flexible. I am generalizing over relevant patterns that I find in the text, and would like to be able to incorporate them as an additional condition if necessary. – owwoow14 Dec 12 '16 at 17:19
  • Also, just to add as a follow up to a comment above, the order of the data in the patterns is not always the same, and when outputting, for instance `format(match[2],match[0]))` would be subjective according to the pattern that extracted the information. – owwoow14 Dec 12 '16 at 17:24
  • depperm, your pattern wouldn't match if a space exists after the comma when the parentheses phrase starts with a number. It needs a `\s?` after the `?` – moogle Dec 12 '16 at 17:25
  • if you look at the blackberry example, if the `(\d+,)?` is not found the `findall(line)` returns an empty string so the order is fine – depperm Dec 12 '16 at 17:25
  • Yes, but it still only works for this one particular pattern with parentheses as a border. I really was hoping for something more dynamic where I can change and add more patterns as necessary. – owwoow14 Dec 12 '16 at 17:53
  • @owwoow14 can you expound on what type of patterns could potentially happen – depperm Dec 12 '16 at 18:45
  • @depperm yes, for instance instead of parentheses, another type of punctuation, or that there is no information within a parenthese, which means that I need to do something else. It can be very varies, which is why I wanted to be able to manipulate it with conditional statements if that makes sense. I tried to demonstrate in the code provided above – owwoow14 Dec 12 '16 at 19:14
  • if you're using regex you should be expecting certain pattern otherwise regex is useless – depperm Dec 12 '16 at 19:47