3
import pandas as pd
df = pd.DataFrame({'Date':['This 1-A16-19 person is BL-17-1111 and other',
                          'dont Z-1-12 do here but NOT 12-24-1981',
                          'numbers: 1A-256-29Q88 ok'], 
                  'IDs': ['A11','B22','C33'],
                  }) 

Using the dataframe above I want to do the following 1) Use regex to identify all digit + number combination e.g 1-A16-19 2) Store in dictionary

Ideally I would like the following output (note that 12-24-1981 intentionally was not picked up by the regex since it doesn't have a letter in it e.g. 1A-24-1981)

{1: 1-A16-19, 2:BL-17-1111, 3: Z-1-12, 4: 1A-256-29Q88}

Can anybody help me do this?

  • 1
    Use `dataframe.extractall(regex)`. I can't see a pattern from the strings you want. Go to [Regex 101](https://regex101.com) to play with the pattern – Code Different Aug 26 '19 at 00:05
  • I guess the pattern would be any combination of letters and digits seperated by `-` –  Aug 26 '19 at 00:08
  • How about “but not 12-24-1981”? – Code Different Aug 26 '19 at 00:09
  • `12-24-1981` should not be picked up because it does not have a letter associated with it. If it were `1A-24-1981` then yes regex should id it –  Aug 26 '19 at 00:10
  • if your matches are always `xxx-xxx-xxx` then you could try `\S+-\S+-\S+` – Umar.H Aug 26 '19 at 00:15
  • @Datanovice the matches could be any combination of `xx-x-x` or `x-xxx-xx` or `xxx-x-xxx` etc –  Aug 26 '19 at 00:39

2 Answers2

1

This regex might do the trick.

(?=.*[a-zA-Z])(\S+-\S+-\S+)

It matches everything between two spaces that has two - in it. Also there won't be a match if there is no letter present.

regex101 example

As you can see for the given input you provided only 1-A16-19, BL-17-1111, Z-1-12 & 1A-256-29Q88 are getting returned.

Bee
  • 1,306
  • 2
  • 10
  • 24
  • what would the full line of code look like? e.g. `x = re.search(\S*-\S*-\S*, dataframe)` ` –  Aug 26 '19 at 00:14
  • and the code above would inadvertently pick up `12-24-1981` –  Aug 26 '19 at 00:21
0

you could try :

vals = df['Date'].str.extractall(r'(\S+-\S+-\S+)')[0].tolist() 
# extract your strings based on your condition above and pass to a list.
# make a list with the index range of your matches.
nums = []
for x,y in enumerate(vals):
    nums.append(x)

pass both lists into a dictionary.

my_dict = dict(zip(nums,vals))
print(my_dict)
 {0: '1-A16-19',
 1: 'BL-17-1111',
 2: 'Z-1-12',
 3: '12-24-1981',
 4: '1A-256-29Q88'}

if you want the index to start at one you can specify this in the enumerate function.

for x,y in enumerate(vals,1):
    nums.append(x)
print(nums)
[1, 2, 3,4,5]
Umar.H
  • 22,559
  • 7
  • 39
  • 74