0

I have two dataframes, I want to reduce the information in the first dataframe that looks like:

event_timestamp      message_number  an_robot
2015-04-15 12:09:39  10125            robot_7
2015-04-15 12:09:41  10053            robot_4
2015-04-15 12:09:44  10156            robot_7
2015-04-15 12:09:47  20205            robot_108
2015-04-15 12:09:51  10010            robot_38
2015-04-15 12:09:54  10012            robot_65
2015-04-15 12:09:59  10011            robot_39

My other dataframe looks like this:

sequence             support
10053,10156,20205    0.94783
10010,10012          0.93322

I want to replace all the sequences that are in dataframe 1, that are apparent in dataframe 2. So the new dataframe should be:

event_timestamp      message_number    an_robot
2015-04-15 12:09:39  10125              robot_7
2015-04-15 12:09:41  10053,10156,20205  robot_4,robot_7,robot_108
2015-04-15 12:09:51  10010,10012        robot_38,robot_65
2015-04-15 12:09:59  10011              robot_39

Anybody who knows how to achieve this? I know how to find if the values match up in exactly one row, but not comparing multiple rows that have to be exactly after each other.

--- EDIT ---

maybe to make it a bit more simple, it would also be fine to generate a new message_number for a sequence. So the new dataframe could be:

event_timestamp      message_number    an_robot
2015-04-15 12:09:39  10125              robot_7
2015-04-15 12:09:41  1                  robot_4,robot_7,robot_108
2015-04-15 12:09:51  2                  robot_38,robot_65
2015-04-15 12:09:59  10011              robot_39

Where each sequence found in the sequence dataframe would be written as a 0, 1, 2, 3 or 4 (till the last sequence). I could always update the database with the meaning of the message_number codes with these new numbers. It would be great though to keep the information about which robot did it, but if that is too complex then this is fine as well.

intStdu
  • 291
  • 1
  • 11
  • Do you know the rule that groups sequences together in your second dataframe? And do you realize your request implies losing some timestamp data? – FabienP Apr 14 '19 at 16:38
  • Yes i have calculated these rules via cspade , i have around 200 rules with high confidence. Dataframe1 is in full size around 4 million rows. – intStdu Apr 14 '19 at 16:39
  • So is it possible to apply these rules to dataframe1? or is it a requirement to use intermediate dataframe2? – FabienP Apr 14 '19 at 16:42
  • Yes these rules are based on dataframe1 and already calculated! – intStdu Apr 14 '19 at 16:57

2 Answers2

1

I am using unnesting for your df2, then map back the rule back to df, and get the groupkey , then using groupby with agg

df1.sequence=df1.sequence.str.split(',')
s=unnesting(df1,['sequence'])

groupkey=df.message_nummber.map(dict(zip(s.sequence.astype(int),s.index))).fillna(df.message_nummber)

df.groupby(groupkey).agg({'event_timestamp':'first','message_nummber':lambda x : ','.join(str(x)),'an_robot':','.join})
                    event_timestamp            ...                               an_robot
message_nummber                                ...
0.0              2015-04-1512:09:41            ...              robot_4,robot_7,robot_108
1.0              2015-04-1512:09:51            ...                      robot_38,robot_65
10011.0          2015-04-1512:09:59            ...                               robot_39
10125.0          2015-04-1512:09:39            ...                                robot_7
[4 rows x 3 columns]

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
  • How did you achieve the unnesting because none of the codes in the link are working. eventually all the rows in the dataset get index 0 after i run this code. – intStdu Apr 15 '19 at 15:32
  • @intStdu did you try the function after split ? – BENY Apr 15 '19 at 15:34
  • yes i did it after the split. I get the error that zero-dimensional arrays cannot be concatenated – intStdu Apr 15 '19 at 15:41
  • @intStdu so you have empty row ?try you own sample, and test whether it work on your side , if so , then the solution should be ok , you may need to look at the your real data to see the difference between the sample you provided – BENY Apr 15 '19 at 15:42
  • Could it be that in some cases, the message_number can have the code 20205_User_defined_event_1 instead of 20205? because this can happen in some cases. the first row of my dataframe now looks like this: 100000 2014-10-20 03:01:39 3,9,6,3,4, , , , , , ,1,0,0,0,0,0,\n, 3,9,7,9,7... robot_90,robot_90,robot_90,robot_90,robot_90,r... Also, my dataframe is now only 489 rows long, which is exactly the number of unique message_numbers there can be. But, I still should have a long dataframe with lots of actions performed that were not caught in a sequence. So maybe around 100.000 rows should be left – intStdu Apr 15 '19 at 15:59
  • @intStdu since it have sep=',' it should be fine – BENY Apr 15 '19 at 16:05
1

If you are interested to make it simple it will be a bit long but good looking work flow or will look like a data pipline.

df_str_2 = """sequence|support
10053,10156,20205|0.94783
10010,10012|0.93322"""

df_2 = pd.read_csv(io.StringIO(df_str_2), sep='|')

# step 1: transform the df 2
# add a id column
df_2["_id"] = df_2.index + 1 
# split sequence to list
df_2["sequence"] = df_2.sequence.apply(lambda x: x.split(",") if isinstance(x, str) else [])

# put each item from the list to a new row
trns_df_2 = (
    df_2.sequence.apply(pd.Series)
    .merge(df_2, right_index=True, left_index=True)
    .drop(["sequence"], axis=1)
    .melt(id_vars=['support', '_id'], value_name="message_number")
    .drop(["variable", "support"], axis=1)
    .dropna()
    .sort_values("_id", ascending=True)
)
# step 2: merge with df 1
df_str_1 = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:59|10011|robot_39"""

df_1 = pd.read_csv(io.StringIO(df_str_1), sep='|')
df_1["message_number"] = df_1.message_number.astype(str)

merged_df = df_1.merge(trns_df_2, on="message_number", how="left")

# take only the inner join and group them by id and other column to list
main_df_inner = (
    merged_df[merged_df["_id"].notnull()]
    .groupby("_id")
    .agg({"event_timestamp": lambda x: list(x),
          "message_number": lambda x: list(x),
          "an_robot": lambda x: list(x)})
    .reset_index()
    .drop("_id", axis=1)
)

# joined the list items in to a list
main_df_inner["event_timestamp"] = main_df_inner.event_timestamp.apply(lambda x: x[0])
main_df_inner["message_number"] = main_df_inner.message_number.apply(lambda x: ",".join(x))
main_df_inner["an_robot"] = main_df_inner.an_robot.apply(lambda x: ",".join(x))

# take only the left part
main_df_left = merged_df[merged_df["_id"].isnull()].drop("_id", axis=1)

# concate the both part and make the final df
main_df = pd.concat([main_df_left, main_df_inner])

The thing left is convert the event_timestamp column to datetime using pd.to_datetime and order the data frame by event_timestamp. I think you can do it by yourself.

DataPsycho
  • 958
  • 1
  • 8
  • 28
  • This works great on the first rows but eventually it get's stuck on the line where merged_df is created. My kernel dies in jupyter. Is there a more efficient way to perform this operation? i already tried dask but that failed as well. my dataset is only 138mb when storage is optimized with dtypes so i don't understand why my laptop would fail. – intStdu Apr 15 '19 at 15:34
  • i notice that most of the rows have far more message_numbers in a row than is possible. The maximum number of items in one row of a sequence in the sequence dataframe is 5. What do you think could be the issue here? – intStdu Apr 15 '19 at 17:23
  • Hi You can put message_number as sorted index in the both dataframe and join over index, using `join` or `merge`. I am not sure if there is some optimization issues in pandas when joining over non index column. – DataPsycho Apr 16 '19 at 07:46