I have two csv files imported as dataframes A
and C
. I want to match the strings of column content
with the entry in data.data
that contains the string from A
.
A time_a content C time_c data.data
100 f00 400 otherf00other
101 ba7 402 onlyrandom
102 4242 407 otherba7other
409 other4242other
Should become:
time_a time_c content
100 400 f00
101 407 ba7
102 409 4242
My solution below uses iterators. But it works too slow. This answer explains why and gives methods how to improve. But I struggle to implement any.
How can I do this with the optimized methods from pandas?
# reset_index() on both df
df_alsa_copy = df_alsa.copy() # Never modify your iterator
df_alsa_copy['cap_fno'] = -1
for aIndex, aRow in df_alsa.iterrows():
for cIndex, cRow in df_c.iterrows():
if str(aRow['content']) in str(cRow['data.data']):
df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# Merge on frame.number column (bc I chose it to be included in alsa_copy as a column)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')
Also tried:
- Would work, if there is an exact match: Pandas: Join dataframe with condition.
- I also managed to match my second frame against a known string with
series.str.contains
. - The problem is, I fail to enter a dataframe column to match in
merge on=
. I can only enter a known string. - The same problem arose, when I used
apply
. - I did not succeed with
isin
or similar.
More info:
A
holds timestamped content I fed into the program. C
is a network capture. I want to know the time in between feeding and capture. I assume:
- The string occur in the same order in
A
andC
. - But in
C
there might be lines in between. - Strings represent hex values.
data.data
contains other chars as well as the string I look for.- Maybe I lack the pandas vocabulary, to look for the correct method.