1

I have two csv files imported as dataframes A and C. I want to match the strings of column content with the entry in data.data that contains the string from A.

A  time_a content    C  time_c data.data
   100    f00           400    otherf00other
   101    ba7           402    onlyrandom
   102    4242          407    otherba7other
                        409    other4242other

Should become:
time_a time_c content
100    400    f00
101    407    ba7
102    409    4242

My solution below uses iterators. But it works too slow. This answer explains why and gives methods how to improve. But I struggle to implement any.

How can I do this with the optimized methods from pandas?

# reset_index() on both df
df_alsa_copy = df_alsa.copy() # Never modify your iterator
df_alsa_copy['cap_fno'] = -1

for aIndex, aRow in df_alsa.iterrows():
    for cIndex, cRow in df_c.iterrows():
        if str(aRow['content']) in str(cRow['data.data']):
            df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# Merge on frame.number column (bc I chose it to be included in alsa_copy as a column)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')

Also tried:

  • Would work, if there is an exact match: Pandas: Join dataframe with condition.
  • I also managed to match my second frame against a known string with series.str.contains.
  • The problem is, I fail to enter a dataframe column to match in merge on=. I can only enter a known string.
  • The same problem arose, when I used apply.
  • I did not succeed with isin or similar.

More info:

A holds timestamped content I fed into the program. C is a network capture. I want to know the time in between feeding and capture. I assume:

  • The string occur in the same order in A and C.
  • But in C there might be lines in between.
  • Strings represent hex values.
  • data.data contains other chars as well as the string I look for.
  • Maybe I lack the pandas vocabulary, to look for the correct method.
Paul Smith
  • 299
  • 2
  • 13
  • Nvm, Had no louck on google. But I found a solution which is a bit faster through SO related questions: https://stackoverflow.com/a/54768170/17834402. Not sure if it is seriously that hard to find or if it was just bad luck. Any advise: Delete this question or leave as duplicate, so others can find the original faster? – Paul Smith Mar 09 '23 at 11:53
  • Which of the approaches from the link are you referring to? – Jamiu S. Mar 09 '23 at 12:25
  • The new one with apply, bfill, lambda expression etc. Unfortunately, I do not fully understand how it works – Paul Smith Mar 09 '23 at 12:27
  • I can provide an alternative solution without lambda and apply specifically to your use-case – Jamiu S. Mar 09 '23 at 12:32
  • 1
    if it is preformant, happy to see it. I could compare performance of both solutions then – Paul Smith Mar 09 '23 at 12:33
  • 1
    I post an answer in a moment – Jamiu S. Mar 09 '23 at 12:40

2 Answers2

2

Try this approach using pandas.unique(), pandas.Series.str.contains and pandas.DataFrame.merge

unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]

out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
               .str.extract(f'({"|".join(unique_str)})')[0],
                right_on='content')[['time_a', 'time_c', 'content']]
print(out)

   time_a  time_c content
0     100     400     f00
1     101     407     ba7
2     102     409    4242
Jamiu S.
  • 5,257
  • 5
  • 12
  • 34
  • Works on the sample data provided. Still issues with my real data. Will report on perfomance, once sorted out. – Paul Smith Mar 09 '23 at 13:24
  • 1
    This approach is expected to be more efficient but because I don't know the structure of your real data it because almost impossible to help you resolve those issues. – Jamiu S. Mar 09 '23 at 19:38
  • 1
    Thanks, but I think I figured it out. TLDR: My data has a problem, not the answer. LONG: You answered, what I asked for. Your answer (correctly) assumes that 'unique_str' does not occur in mmulitple rows in C. However, in my data it does. My other coding approach just took the first appearance of the unique_str, which made it look correct. But your code actually gets the correct result and showed a flaw in my data. – Paul Smith Mar 12 '23 at 16:09
  • 1
    Good to know that @PaulSmith – Jamiu S. Mar 12 '23 at 17:26
1

if you want to improve speed, a different option you might consider is using Polars (https://www.pola.rs/). You can install Polars with pip install polars

The solution is the same to what @Jamiu proposes, I think he has the right approach. The only difference is Polars instead of Pandas.

I tested the 2 solutions by multiplying the number of rows by 1000. The Pandas solution takes 400ms while the Polars one takes 92ms.

import polars as pl

# convert the data to Polars dataframes
a, c = pl.from_pandas(A), pl.from_pandas(C)

# calculate unique values and join
unique_values = f"({a['content'].unique().str.concat('|').item()})"
out = (
    a
    .join(c.filter(pl.col('data.data').str.contains(unique_values)), 
    left_on = 'content', right_on = pl.col('data.data').str.extract(unique_values))
)

# convert back to Pandas if needed
out_pandas = out.to_pandas()
Luca
  • 1,216
  • 6
  • 10
  • Thanks for the hint, I will keep it in mind, when I need more performance. Right now, speed is okay. Jamiu helped get rid of the iterator. That increased performance by some orders of mangitude already. – Paul Smith Mar 12 '23 at 15:24