4

I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:

df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'], 
                  columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])

Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:

for seq in df.index:
    for seq2 in df.columns:
        df.loc[seq, seq2] = edit_distance(seq, seq2) 

This produces the result I want:

            ae  azde    afgle   arlde   afghijklbcmde
afghijklde  8    7        5       6          3
afghijklmde 9    8        6       7          2
ade         1    1        3       2          10
afghilmde   7    6        4       5          4
amde        2    1        3       2          9

What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.

gnotnek
  • 309
  • 4
  • 14
  • The reason that this is slow is becuase you have many python nested for loops, not only in your dataframe control flow but the distance function itself. To speed it up you would want to try and vectorise all of these. Applymap doesn't do that it just applies element wise anyway. Personally, to really optimise it I'd look at taking advantage of some inherent structure of the words if they were ordered in the index in a clever way. You might even be able to use estimates and considerably reduce the section of scope you are trying to detect. – Attack68 Feb 21 '18 at 21:29

2 Answers2

6

Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.

series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])

df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
gnotnek
  • 309
  • 4
  • 14
1

you could use comprehensions, which speeds it up ~4.5x on my pc

first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')

# output
#              ae  azde  afgle arlde  afghijklbcmde
# ade          1   2     2     2      2
# afghijklde   1   3     4     4      9
# afghijklmde  1   3     4     4      10
# afghilmde    1   3     4     4      8
# amde         1   3     3     3      3

# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.

note I used this code for edit_distance (first response in your link):

def edit_distance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]
onepan
  • 946
  • 5
  • 8
  • Thanks, but this seemed to have produced the wrong values for each cell, and three of the cells as NaN even though the pairwise `edit_distance` between them is an integer. – gnotnek Feb 21 '18 at 22:51
  • added output that I get. It matches when I test it - do you have a specific example where it fails? Note I made a typo and corrected the dict comprehension just now – onepan Feb 22 '18 at 00:16
  • actually, I misread your example output. Are you trying to run `edit_distance` on `(index, index)` or `(index, column)`? – onepan Feb 22 '18 at 00:19
  • Thanks, your code works correctly now. I also updated my OQ fixing a typo in the df. Do you mind me asking, why use a dictionary comprehension instead of list comprehension? It took me a while to parse your code and I'm not sure I can conceptually replicate it next time. – gnotnek Feb 24 '18 at 21:53
  • you can do it with list comps too, but using a dict saves you the step of naming indeces and columns. The dict in my code is an index associated with the labelled outputs of edit_distance (e.g. `{'ade': {'azde': 2}}`) – onepan Feb 26 '18 at 01:26