0

I have table1, which can range from <100 to about 100,000 rows. It contains 22 columns, one of which is a Description containing a string where I want to search some terms.

table1
+----------+--------+--------+--------+-------------+------+-------+-------+
| UniqueId | Cat1Id | Cat2Id | Cat3Id |    Col4     | Col5 | Col21 | Col22 |
+----------+--------+--------+--------+-------------+------+-------+-------+
|        1 |    123 |        |     33 | Description |      |       |       |
|        2 |     11 |     22 |        | Description |      |       |       |
|        3 |     22 |     22 |     33 | Description |      |       |       |
|        4 |     33 |        |    455 | Description |      |       |       |
|        5 |    111 |        |        | Description |      |       |       |
|          |        |        |        |             |      |       |       |
+----------+--------+--------+--------+-------------+------+-------+-------+

table2 below contains substrings in col Term that I want to search in Col4 above based on the Id's. This can contain hundreds of rows.

table2
+--------+--------+--------+------+------+------+
| Cat1Id | Cat2Id | Cat3Id | Term | Val1 | Val2 |
+--------+--------+--------+------+------+------+
|    123 |        |     33 | abc  |  555 |   66 |
|    123 |        |     33 | bca  |  444 |   55 |
|     11 |     22 |        | blah |  888 |   77 |
|     33 |        |    455 | dddd |  999 |   77 |
|     33 |        |    455 | aaaa |  777 |   22 |
|        |        |        |      |      |      |
+--------+--------+--------+------+------+------+

I need to save the Val1 and Val2 for each row in table1. So in the end, I want something like:

table3 - final
+----------+--------+--------+--------+-------------+------+------+------+
| UniqueId | Cat1Id | Cat2Id | Cat3Id |    Col4     | Term | Val1 | Val2 |
+----------+--------+--------+--------+-------------+------+------+------+
|        1 |    123 |        |     33 | Description | abc  |  555 |   66 |
|        1 |    123 |        |     33 | Description | bca  |  444 |   55 |
|        2 |     11 |     22 |        | Description | blah |  888 |   77 |
|        4 |     33 |        |    455 | Description | dddd |  999 |   77 |
|        4 |     33 |        |    455 | Description | aaaa |  777 |   22 |
|          |        |        |        |             |      |      |      |
+----------+--------+--------+--------+-------------+------+------+------+

My plan is to do this:

  1. For each unique Cat1, 2, 3 combo in table2, query table1 for the subset of rows that contain the Cat1Id, Cat2Id, Cat3Id.

  2. Go back to table2 and make a dictionary with the Term as key, and Val1 and Val2 as values.

  3. Iterate over the rows from #1 above, search through each description each term. If I get a hit, then grab that row, assign Val1 and Val2, and append all this data to a new df. Repeat.

Is there a faster way of doing this? If I have 100k rows, in your experience will this go slow or ok? A couple minutes of run time is fine for me.

maxutil
  • 195
  • 3
  • 13
  • Manually iterating over pandas rows is always a last resort. Indeed, you often are better off ditching pandas at that point altogether. However, what you are trying to do is likely better accomplished using built in pandas methods. – juanpa.arrivillaga Mar 26 '19 at 17:42
  • @juanpa.arrivillaga Ok, so I guess I can join table1 and table2. Then see if there's a method that will search each Description for the substrings via the whole column rather than iterating over each one. Not all will hit. So after that, drop any rows that didn't get a match. – maxutil Mar 26 '19 at 17:48

1 Answers1

0

Might this be something you are looking for? You might look into merging 101 on SO for more options (Pandas Merging 101)

mrg = pd.merge(df1, df2[['Cat1Id', 'Cat2Id', 'Cat3Id','Val1', 'Val2']], how='left', left_on=[
               'Cat1Id', 'Cat2Id', 'Cat3Id'], right_on=['Cat1Id', 'Cat2Id', 'Cat3Id'])

mrg.dropna()

    UniqueId   Cat1Id    Cat2Id    Cat3Id       Col4       Val1   Val2 
0  1           123                33         Description   555    66   
1  1           123                33         Description   444    55   
2  2           11       22                   Description   888    77   
4  4           33                 455        Description   999    77   
5  4           33                 455        Description   777    22
Zanshin
  • 1,262
  • 1
  • 14
  • 30
  • Hey thanks for the reply. I ended up doing similar merges as your example above to achieve what I wanted. Merge is much better and faster than iterating. – maxutil Mar 28 '19 at 16:42