0

I have two columns of Russian verbs in Excel, the first being of imperfective verbs and the second of perfective ones. However, they are randomly ordered, i.e. not paired up, whereas I would like for each perfective verb to appear next to its imperfective counterpart (wherever applicable), i.e. I want the perfective verb column sorted according to the first one.

Sadly, I have no idea how to achieve this without looking for each perfective verb's imperfective counterpart individually, and manually pairing them up, which would take dreadfully long, since there are about 1,200 verbs in question. So, I was hoping that Excel has some feature which pairs cells based on the similarity of their content (and since a given perfective verb generally looks very similar to its imperfective counterpart, this could hypothetically work), namely which takes a cell from one column and compares it against all the cells in the other column, to find which one of them is the most similar to it, so that it can put it next to that one.

When I speak of similarity, I am referring to coincidence of characters, namely how many characters are shared between two cells, with attention to character order as well. So, if a given perfective verb is "vybrat'", upon checking, the hypothetical mechanism would locate "vybirat'" as the most similar imperfective verb, having all the characters of "vybrat''" plus one additional one, with no shuffling of order, and the two would be paired. Meanwhile, "vybrat'" wouldn't be erroneously paired with "brat'", since although "brat'" occurs within "vybrat'", there is a difference of two characters, whereas there is a difference of only one character in the case of "vybrat'" with "vybirat'". I hope this makes sense. By the way, if it's relevant, everything is written in the Cyrillic alphabet in the actual Excel sheet.

Mind you, when I speak of pairing based on similarity, I'm not talking just about the cells starting with the same sequence of characters or ending with the same sequence of characters - I'm speaking about overall overlaps, in the centre of cells to. Because, if we go by starting sequences, "propylesosit'" wouldn't be paired with "pylesosit'", and if we go by ending sequences, "utihomirit'" wouldn't be paired with "utihomirivat'", despite the obvious similarities of character sequences in these term pairs.

So, can Excel be smart enough to achieve this? If yes, could someone tell me how I can activate the necessary process? Bear in mind that I don't understand the first thing about Excel (e.g. I can't even sum number cells), so in your responses, please lay out everything in the simplest possible way, telling me what to click where and what to type or copy paste, without expecting me to actually understand anything or figure anything out by myself.

If it's any help, someone I know managed to solve my problem partly, by using something that looks like this: "=VLOOKUP(""&LEFT(A4;LEN(A4)/2)&"";B$1:B$851;1;0)"

Here's how it worked:

бегать - вздохнуть - убегать

брать - вспотеть - выбрать

висеть - встать на колени - свистнуть

воспитывать - выбрать - воспитать

These are the first four rows of the list. The first column is of the imperfective verbs. The second one is of the perfective verbs ordered in a random way in comparison to the imperfective verbs (alphabetized in clusters of 100, and I realise that I could alphabetize them all together easily, alongside the alphabetizedim perfective verbs, but that wouldn't solve the problem, because of the presence of prefixes, which would ruin the matching - furthermore, there are significantly fewer perfective verbs, such that even if there weren't problematic prefixes, the matching would collapse when the first imperfective verb without a perfective counterpart would appear). The third one is of the perfective verb corresponding to the imperfective verb in the first column, after the matching. The first three rows are actually wrong (falsely matched), but in the fourth row, "vospityvat'" is correctly matched with "vospitat'", and it is for "vospitat'" that I provided that alien-looking formula.

Anyway, I'm not satisfied with what the person I know provided me with, because I no longer have access to him, whereas the list of verbs is now much larger than it was (he wrote the formula when I had significantly fewer verbs). Meanwhile, I can't just add the new verbs to the sheet that he has prepared, because that somehow destroys his formula, and the third column becomes filled with #REF!, whatever that may mean.

So, could you assist me? I am not asking that you rely upon the formula given by the person I know - anything effective that you think of would be desirable. Meanwhile, I cannot attach the verb sheet, because there is no attach file button here, as far as I see.

Thank you in advance

1 Answers1

1

As far as I know there isn't any build-in function in Excel with can measure similarity between two words. There are some algoritm with can measure similarity but there are not perfect ( that mean that you still have to check if results are correct). One of them is Levenshtein Distance (you can read here something about it: http://en.wikipedia.org/wiki/Levenshtein_distance).

And here you can find VBA algorithm: Levenshtein Distance in Excel.

Community
  • 1
  • 1
Wiol
  • 11
  • 2
  • Thank you for the reply, but both of the links you've sent me appear as though they were written in Chinese. Can I just copy paste the algorithm that appears in the second link (the Stack Overflow thread) somewhere and press enter, whereupon it will magically work? If so, where should I paste it? If not, could you provide me with instructions? – user3221174 Mar 22 '15 at 15:34
  • Use the last algorithm -> it is the fastest. – Wiol Mar 22 '15 at 18:01
  • To do this - simply copy code to VBA editor (to open press alt+F11). Levenshtein Distance is simply the number of changes that you have to get from one word to other. Example: Levenshtein("monday","mondaya")=1 because you have to remove letter "a" from second word. Levenshtein("monday","mondara")=2 because you have substitute "r" (by "y") and remove "a". – Wiol Mar 22 '15 at 18:14