1

I'm trying to do something similar to How to loop through pandas df column, finding if string contains any string from a separate pandas df column?, specifically the second problem

My initial dataframe logs contains the following:

ID DB  USER  MDX_TEXT
1  DB1  JOE  SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS,[ATTR2].[ATTR2].[THE_ATTR2] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]
2  DB1  JAY  SELECT [ATTR1].[ATTR1].[THE_ATTR1] ON COLUMNS, [ATTR3].[ATTR3].[THE_ATTR3] ON ROWS FROM [THE_DB] WHERE [ATTR3].[ATTR3].[THE_ATTR3]

Using regex, I then extract the unique instances of MDX_TEXT per ID

# Step 1: Define a pattern to extract MDX regex pattern
import re
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\](?:\.(?:Members|\[Q\d\]))?)')

# Step 2: Create a dataframe to store distinct list of MDX Query attributes, based on pattern
extrpat =  (
            logs['MDX_TEXT'].str.extractall(pattern)
                            .drop_duplicates()
                            .to_numpy()
           )
# Step 3: Create a dataframe to store distinct list of attributes used in a query
attr= pd.DataFrame(data=extrpat)
attr.rename({0: 'attrname'}, inplace=True, axis=1)

# Preview the first 5 lines of the attributes dataframe
attr.head()

Which results in:

attrname
[THE_ATTR1]
[THE_ATTR2]
[THE_ATTR3]
[THE_ATTR1]
[THE_ATTR3]

What I would like, is in addition to extracting the unique attributes in step 2, to also extract the ID and USER, like this:

ID  USER  attrname
1  JOE  [THE_ATTR1]
1  JOE  [THE_ATTR2]
1  JOE  [THE_ATTR3]
2  JAY  [THE_ATTR1]
2  JAY  [THE_ATTR3]

and then finally, join the attr and logs dataframes on the ID. The idea is to bring in a third dataframe users:

USER  LOC
JOE  NY
JIL  NJ
MAC  CA

...which I will join with the aforementioned on USER to end up with this:

ID  USER  LOC  attrname
1  JOE  NY  [THE_ATTR1]
1  JOE  NY  [THE_ATTR2]
1  JOE  NY  [THE_ATTR3]
2  JAY  NJ  [THE_ATTR1]
2  JAY  NJ  [THE_ATTR3]
teflonjon
  • 9
  • 4

1 Answers1

0

(edited).

The pattern piece is a good start, but then you have to merge / join it with the original dataframe:

df.index.name = "inx"
pattern = re.compile (r'(\[[\w ]+\]\.\[[\w ]+\])')

# extract the attributes. 
extracts = df.MDX_TEXT.str.extractall(pattern).rename(columns={0:"attrname"})

# join the result with the original dataframe. 
res = df.join(extracts).reset_index()[["ID", "USER", "attrname"]].drop_duplicates()

# take just the last part of each attribute name. 
res["attrname"] = res["attrname"].str.split(".", expand = True).iloc[:, -1]

The result is:

   ID USER attrname
0   1  JOE  [ATTR1]
1   1  JOE  [ATTR2]
2   1  JOE  [ATTR3]
3   2  JAY  [ATTR1]
4   2  JAY  [ATTR3]
Roy2012
  • 11,755
  • 2
  • 22
  • 35