I am trying to make a function where I use regular expressions conditionally. I am trying to extract attribute information about a product, and I have generalize a few different patterns that could help me to extract the data.
The working code that I have thus far is:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
import re
filename = '/PATH/TO/dataFILE'
with open(filename) as f:
for line in f:
m0 = re.compile('[a-z-A-Z-0-9--]+\s\([a-z-A-Z]+,\s[-0-9-]+\)')
m1 = re.compile('[a-z-A-Z-0-9--]+\s\([0-9-]+,\s[a-z-A-Z-]+\)')
if m0.findall(line):
matching_words = m0.findall(line)
for word in matching_words:
cleanwords = [x.strip(string.punctuation) for x in word.split()]
if len(cleanwords[0]) > 2:
print 'Product: ' + cleanwords[1] +'\n' + 'Attribute: '+cleanwords[0]
Up until this point the code works and outputs properly - when I add the elif
is where I have problem
elif m1.findall(line):
matching_words = m1.findall(line)
for word in matching_words:
cleanwords = [x.strip(string.punctuation) for x in word.split()]
if len(cleanwords[0]) > 2:
print 'Product: ' + cleanwords[2] +'\n' + 'Attribute: '+cleanwords[0]
An example of the datafiles that I am working with is (I provide parallel dummy data):
The cellphone DeluxeModel (Samsung, 2007) is the best on the market. It is possible that the LightModel (Apple, 2010) is also relevant. It has been said that NewModel (1997,Blackberry) could also be useful - but I don't know.
The desired result is
Company: Samsung Product: DeluxeModel
Company: Apple Product: LightModel
Company: Blackberry Product: NewModel
I have already consulted HERE and HERE regarding cascading and grouping methods for what I am trying to implement, but I am unable to see why my implementation is incorrect. Is there a way for me to adapt my code to provide the desired result?
UPDATED CODE
I have been trying different modifications - and I have been able to output results, however, each time that I add a new condition, the results become more restricted, is there any way that this can be optimized?
filename = '/PATH/TO/DATA'
with open(filename) as f:
for line in f:
m0 = re.compile('[a-z-A-Z-0-9--]+\s\([a-z-A-Z-0-9--]+,\s[a-z-A-Z-0-9--]+\) | [a-z-A-Z-0-9--]+\s\([A-Z][a-z-]+\)' )
m1 = re.compile('[a-zA-Z0-9-]+\s\(>[0-9]+.[0-9]\%,\s[a-zA-Z0-9-]+\)')
m2 = re.compile('[a-zA-Z0-9-]+\s\([a-zA-Z0-9-]+\),\s>[0-9]+.[0-9]\%')
if m0.findall(line):
matching_words = m0.findall(line)
for word in matching_words:
cleanwords = [x.strip(string.punctuation) for x in word.split()]
if len(cleanwords[0]) > 2:
print 'Company: ' + cleanwords[1] +'\n' + 'Product: '+cleanwords[0]
if m1.findall(line):
matching_words = m1.findall(line)
for word in matching_words:
cleanwords = [x.strip(string.punctuation) for x in word.split()]
if len(cleanwords[0]) > 2:
print 'Company: ' + cleanwords[2] +'\n' + 'Product: '+cleanwords[0]
if m2.findall(line):
matching_words = m2.findall(line)
for word in matching_words:
cleanwords = [x.strip(string.punctuation) for x in word.split()]
if len(cleanwords[0]) > 2:
print 'Company: ' + cleanwords[1] +'\n' + 'Product: '+cleanwords[0]