I'm trying to do something similar to How to loop through pandas df column, finding if string contains any string from a separate pandas df column?, specifically the second problem
My initial dataframe logs
contains the following:
ID DB USER MDX_TEXT
1 DB1 JOE SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS,[ATTR2].[ATTR2].[THE_ATTR2] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]
2 DB1 JAY SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS, [ATTR3].[ATTR3].[THE_ATTR3] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]
Using regex, I then extract the unique instances of MDX_TEXT
per ID
# Step 1: Define a pattern to extract MDX regex pattern
import re
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\](?:\.(?:Members|\[Q\d\]))?)')
# Step 2: Create a dataframe to store distinct list of MDX Query attributes, based on pattern
extrpat = (
logs['MDX_TEXT'].str.extractall(pattern)
.drop_duplicates()
.to_numpy()
)
# Step 3: Create a dataframe to store distinct list of attributes used in a query
attr= pd.DataFrame(data=extrpat)
attr.rename({0: 'attrname'}, inplace=True, axis=1)
# Preview the first 5 lines of the attributes dataframe
attr.head()
Which results in:
attrname
[THE_ATTR1]
[THE_ATTR2]
[THE_ATTR3]
[THE_ATTR1]
[THE_ATTR3]
What I would like, is in addition to extracting the unique attributes in step 2, to also extract the ID
and USER
, like this:
ID USER attrname
1 JOE [THE_ATTR1]
1 JOE [THE_ATTR2]
1 JOE [THE_ATTR3]
2 JAY [THE_ATTR1]
2 JAY [THE_ATTR3]
and then finally, join the attr
and logs
dataframes on the ID
. The idea is to bring in a third dataframe users
:
USER LOC
JOE NY
JIL NJ
MAC CA
...which I will join with the aforementioned on USER
to end up with this:
ID USER LOC attrname
1 JOE NY [THE_ATTR1]
1 JOE NY [THE_ATTR2]
1 JOE NY [THE_ATTR3]
2 JAY NJ [THE_ATTR1]
2 JAY NJ [THE_ATTR3]