Search for string pattern in dataframe column, return each occurence and join to another dataframe

Question

I'm trying to do something similar to How to loop through pandas df column, finding if string contains any string from a separate pandas df column?, specifically the second problem

My initial dataframe logs contains the following:

ID DB  USER  MDX_TEXT
1  DB1  JOE  SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS,[ATTR2].[ATTR2].[THE_ATTR2] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]
2  DB1  JAY  SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS, [ATTR3].[ATTR3].[THE_ATTR3] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]

Using regex, I then extract the unique instances of MDX_TEXT per ID

# Step 1: Define a pattern to extract MDX regex pattern
import re
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\](?:\.(?:Members|\[Q\d\]))?)')

# Step 2: Create a dataframe to store distinct list of MDX Query attributes, based on pattern
extrpat =  (
            logs['MDX_TEXT'].str.extractall(pattern)
                            .drop_duplicates()
                            .to_numpy()
           )
# Step 3: Create a dataframe to store distinct list of attributes used in a query
attr= pd.DataFrame(data=extrpat)
attr.rename({0: 'attrname'}, inplace=True, axis=1)

# Preview the first 5 lines of the attributes dataframe
attr.head()

Which results in:

attrname
[THE_ATTR1]
[THE_ATTR2]
[THE_ATTR3]
[THE_ATTR1]
[THE_ATTR3]

What I would like, is in addition to extracting the unique attributes in step 2, to also extract the ID and USER, like this:

ID  USER  attrname
1  JOE  [THE_ATTR1]
1  JOE  [THE_ATTR2]
1  JOE  [THE_ATTR3]
2  JAY  [THE_ATTR1]
2  JAY  [THE_ATTR3]

and then finally, join the attr and logs dataframes on the ID. The idea is to bring in a third dataframe users:

USER  LOC
JOE  NY
JIL  NJ
MAC  CA

...which I will join with the aforementioned on USER to end up with this:

ID  USER  LOC  attrname
1  JOE  NY  [THE_ATTR1]
1  JOE  NY  [THE_ATTR2]
1  JOE  NY  [THE_ATTR3]
2  JAY  NJ  [THE_ATTR1]
2  JAY  NJ  [THE_ATTR3]

Roy2012 · Accepted Answer · 2020-06-10T10:08:54.947

0

(edited).

The pattern piece is a good start, but then you have to merge / join it with the original dataframe:

df.index.name = "inx"
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\])')

# extract the attributes. 
extracts = df.MDX_TEXT.str.extractall(pattern).rename(columns={0:"attrname"})

# join the result with the original dataframe. 
res = df.join(extracts).reset_index()[["ID", "USER", "attrname"]].drop_duplicates()

# take just the last part of each attribute name. 
res["attrname"] = res["attrname"].str.split(".", expand = True).iloc[:, -1]

The result is:

   ID USER attrname
0   1  JOE  [ATTR1]
1   1  JOE  [ATTR2]
2   1  JOE  [ATTR3]
3   2  JAY  [ATTR1]
4   2  JAY  [ATTR3]

edited Jun 10 '20 at 10:08

answered Jun 09 '20 at 19:50

Roy2012

11,755
2
22
35

Does this help? – Roy2012 Jun 10 '20 at 05:18
Thanks, but that's only one line in the attribute 'text'. I need to iterate over the whole dataframe, and I'm cautious of iterating over the dataframe. – teflonjon Jun 10 '20 at 09:53
Is the original data a dataframe with three columns? (I somehow missed that). – Roy2012 Jun 10 '20 at 09:54

Search for string pattern in dataframe column, return each occurence and join to another dataframe

1 Answers1