regex to split %ages and values in python

Question

Hello I am new to python and regex. I have a large CSV file which has a field like, %age composition that contains values such as:

'34% passed 23% failed 46% deferred'

How would you split this string so that you get a dictionary object:

{'passed': 34, 'failed': 23, 'deferred': 46} for each row?

I tried this:

for line in csv_lines:
    for match in re.findall('[\d\s%%]*\s', line)

but this only took the %age value

You can take a look at this site to help with regex construction: http://txt2re.com/. Otherwise, please show us your attempts so we can help you improve them, rather than just asking someone to do it for you. — g.d.d.c, Sep 02 '14 at 16:49

score 5 · Accepted Answer · answered Sep 02 '14 at 16:56

5

And if you still want to go with regular expressions, you can use this one:

(\w+)%\s(\w+)

Which would match one or more alphanumeric characters (alternative: [0-9a-zA-Z_]+) followed by % sign, space character and one or more alphanumeric characters. Parenthesis help with capturing appropriate set of characters.

Demo:

>>> import re
>>> s = '34% passed 23% failed 46% deferred'
>>> pattern = re.compile(r'(\w+)%\s(\w+)')
>>> {value: key for key, value in pattern.findall(s)}
{'failed': '23', 'passed': '34', 'deferred': '46'}

answered Sep 02 '14 at 16:56

alecxe

462,703
120
1,088
1,195

You got it in before I did :) and cleaner too – ashwinjv Sep 02 '14 at 17:04
@Ashwin same here, falsetru made me provide a regex-based approach :) – alecxe Sep 02 '14 at 17:06

score 3 · Answer 2 · edited May 23 '17 at 12:10

3

You don't need to use regular expression:

>>> s = '34% passed 23% failed 46% deferred'
>>> groups = zip(*[iter(s.split())]*2)
>>> groups
[('34%', 'passed'), ('23%', 'failed'), ('46%', 'deferred')]
>>> {result: int(percent.rstrip('%')) for percent, result in groups}
{'failed': 23, 'passed': 34, 'deferred': 46}

zip(*[iter(..)]*2) came from grouper - itertools recipes (Also see How does zip(*[iter(s)]*n) work in Python?):

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

edited May 23 '17 at 12:10

Community

1
1

answered Sep 02 '14 at 16:53

falsetru

357,413
63
732
636

Might be worth linking to an [explanation](http://stackoverflow.com/questions/2233204/how-does-zipitersn-work-in-python) of the `zip-iter` magic. – DSM Sep 02 '14 at 16:57
@DSM, Thank you for the comment. I updated the answer accordingly. – falsetru Sep 02 '14 at 17:01

score 0 · Answer 3 · edited May 23 '17 at 10:33

0

Try this:

[EDIT: Added list support for words to check based on OPs request. Also cleaned the dictionary building code that alecx uses here: https://stackoverflow.com/a/25628562/3646530]

import re

data = """34% passed 23% failed 46% deferred 34% checked"""
checkList = ['passed', 'failed', 'deferred', 'checked']
result = {k:v for (v, k) in re.findall('(\d{1,3})% (' + '|'.join(checkList) + ')', data)}
print(result) # Python 3
#print result # Python 2.7

Here the regex is \d{1,3} - to catch the percentage int and passed|failed|deferred to get the type. I use a list comprehension to generate a list of tuples of keys and values, which I then convert to a dictionary

In order to build the string 'passed|failed| ..' I use the .join function of a string to join words from a checkList with a pipe character as the separator.

edited May 23 '17 at 10:33

Community

1
1

answered Sep 02 '14 at 17:00

ashwinjv

2,787
1
23
32

1

`\w+` might be better, in case those are just a sample of possible options. Grabbing the immediate word after would be better than a possible list of them. – Sterling Archer Sep 02 '14 at 17:09
so, something like `result = dict([(k,v) for (v, k) in re.findall('(\d{1,3})%\w+', data)])` to grab words after after %age? – aqoon Sep 03 '14 at 14:15
1

also how would you make `passed|failed|deferred` be linked to a list with other values to check? – aqoon Sep 03 '14 at 14:24
@aqoon edited my answer to support that. Also \w+ will match all the words after that. But in order to grab it, you have to group it. so using (\w+) will group it and make it accessible. Put the regex in a tool like this: https://www.debuggex.com/ to visualize your groups and test your regex – ashwinjv Sep 03 '14 at 17:40

regex to split %ages and values in python

3 Answers3

Linked