identify letter/number combinations using regex and storing in dictionary

Question

import pandas as pd
df = pd.DataFrame({'Date':['This 1-A16-19 person is BL-17-1111 and other',
                          'dont Z-1-12 do here but NOT 12-24-1981',
                          'numbers: 1A-256-29Q88 ok'], 
                  'IDs': ['A11','B22','C33'],
                  })

Using the dataframe above I want to do the following 1) Use regex to identify all digit + number combination e.g 1-A16-19 2) Store in dictionary

Ideally I would like the following output (note that 12-24-1981 intentionally was not picked up by the regex since it doesn't have a letter in it e.g. 1A-24-1981)

{1: 1-A16-19, 2:BL-17-1111, 3: Z-1-12, 4: 1A-256-29Q88}

Can anybody help me do this?

Use `dataframe.extractall(regex)`. I can't see a pattern from the strings you want. Go to [Regex 101](https://regex101.com) to play with the pattern — Code Different, Aug 26 '19 at 00:05
I guess the pattern would be any combination of letters and digits seperated by `-` — , Aug 26 '19 at 00:08
`12-24-1981` should not be picked up because it does not have a letter associated with it. If it were `1A-24-1981` then yes regex should id it — , Aug 26 '19 at 00:10
if your matches are always `xxx-xxx-xxx` then you could try `\S+-\S+-\S+` — Umar.H, Aug 26 '19 at 00:15
@Datanovice the matches could be any combination of `xx-x-x` or `x-xxx-xx` or `xxx-x-xxx` etc — , Aug 26 '19 at 00:39

Bee · Accepted Answer · 2019-08-26T00:45:16.183

1

This regex might do the trick.

(?=.*[a-zA-Z])(\S+-\S+-\S+)

It matches everything between two spaces that has two - in it. Also there won't be a match if there is no letter present.

regex101 example

As you can see for the given input you provided only 1-A16-19, BL-17-1111, Z-1-12 & 1A-256-29Q88 are getting returned.

edited Aug 26 '19 at 00:45

answered Aug 26 '19 at 00:11

Bee

1,306
2
10
24

what would the full line of code look like? e.g. `x = re.search(\S*-\S*-\S*, dataframe)` ` – Aug 26 '19 at 00:14
and the code above would inadvertently pick up `12-24-1981` – Aug 26 '19 at 00:21

Umar.H · Answer 2 · 2019-08-26T00:50:47.297

0

you could try :

vals = df['Date'].str.extractall(r'(\S+-\S+-\S+)')[0].tolist() 
# extract your strings based on your condition above and pass to a list.
# make a list with the index range of your matches.
nums = []
for x,y in enumerate(vals):
    nums.append(x)

pass both lists into a dictionary.

my_dict = dict(zip(nums,vals))
print(my_dict)
 {0: '1-A16-19',
 1: 'BL-17-1111',
 2: 'Z-1-12',
 3: '12-24-1981',
 4: '1A-256-29Q88'}

if you want the index to start at one you can specify this in the enumerate function.

for x,y in enumerate(vals,1):
    nums.append(x)
print(nums)
[1, 2, 3,4,5]

edited Aug 26 '19 at 00:50

answered Aug 26 '19 at 00:26

Umar.H

22,559
7
39
74

I get an `AttributeError: 'str' object has no attribute 'tolist'` when I enter `vals = df['Date'].str.extract(r'(\S+-\S+-\S+)')[0].tolist()` – Aug 26 '19 at 00:37
what do you get when you print the line without `.tolist()` – Umar.H Aug 26 '19 at 00:37
i get `'1-A16-19'` – Aug 26 '19 at 00:40
updated my answer - above should work, it works for me anyway. – Umar.H Aug 26 '19 at 00:49
Using `vals = dataframe['Date'].str.extract(r'(\S+-\S+-\S+)')` works! – Aug 26 '19 at 00:49
change `extract` to `extractall` to find all the matches. – Umar.H Aug 26 '19 at 00:51
actually this code is really close but picks up `12-24-1981` which it shouldn't – Aug 26 '19 at 01:03

identify letter/number combinations using regex and storing in dictionary

2 Answers2

pass both lists into a dictionary.

Linked