0

I am not generally a python user. I do most things in R and Stata. However, I cannot find a good semantic similarity package/API in either of those.

I have two data frames in the environment. One is questions, which consists of 2 columns and 3 rows. The other is results, which has 3 columns and 3 rows.

I am trying to compare each question (individually) in the first column of the questions dataframe to all of the questions in the second column. Then I want the output to populate the results dataframe. The function takes two strings as arguments So far my code looks like this:

    for i in range(1, 3):
        results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])

I assume that I am pointing to the data incorrectly, but I really haven't been able to find good documentation about this seemingly straightforward issue.

By comparison, here is my working R code, which uses a slightly different function and only one dataframe.

    for (i in 1:3){
        df[,i+2] <- levenshteinSim(df$yr1questions[i], df$yr2questions)
    }

Any help would be greatly appreciated! I am trying to come up with proof-of-concept code to compare similar survey questions between years based on semantic meaning.

Bob

rtaylor
  • 15
  • 5

2 Answers2

0

Let's try to compare(multiply) every question A to every question B

import pandas as pd
questions = pd.DataFrame(np.arange(6).reshape(3,2), columns=["question_A", "question_B"])

This gives:

   question_A  question_B
0           0           1
1           2           3
2           4           5

Then let's define a compare fonction:

def compare(row):
    return pd.Series(row[0]*questions['question_B'])

results = questions.apply(compare, axis=1)

That gives us:

   0   1   2
0  0   0   0
1  2   6  10
2  4  12  20

As you pointed in the comments, here is a version comparing only two strings at a time:

def compare(row):
    question_a = row[0]
    return pd.Series([liteClient.compare(question_a, question_b) for question_b in questions['question_B']])
x0s
  • 1,648
  • 17
  • 17
  • This worked! Is there a quick way to get the results to fill by column rather than by row? That way the output for col1row1 comparisons would be in col1 of a results dataframe? Then col1row2 results would be in col2 of a results dataframe. This is awesome...I just have a ton to learn about python. – rtaylor May 03 '17 at 17:40
  • I just transposed it with pandas. No biggie! Thanks so much for your help. – rtaylor May 03 '17 at 17:47
  • Great it helped :) Indeed, to get results by column, you can transpose: `results = questions.apply(compare, axis=1).T`. If it is working for your problem, could you mark this answer as accepted ? – x0s May 04 '17 at 08:39
0

Based on what you've put so far here are some issues with what you've written which are understandable from your R programming background:

for i in range(1, 3):

In python 3.x, what this does is create a range object which you can think of as a special type of function (though is really an object that contains iteration properties) that allows you to generate a sequence of numbers with a certain step size (default is 1) exclusively. Additionally you need to know that most programming languages index starting at zero, not 1, and this includes python.

What this range object does here is generate the sequence 1, 2 and that is it.

Your arrays you are using i to index over are not going to index over all indicies. What I believe you want is something like:

 for i in range(3):

Notice how there is only one number here, this defaults to the exclusive maximum of the range, and 0 being the inclusive minimum, so this will generate the sequence of 0,1,2. If you have an array of size 3, this will represent all possible indices for that array.

This next line is a bit confusing to me, since I'm not familiar with R, but I sort of understand what you were trying to do. If I understand correctly you are trying to compare two columns of 3 questions each, and compare each question in column 1 to the questions in column 2, resulting in a 3 x 3 matrix of comparison results which you are trying to store in results?. Assuming the size are already correct (as in results is 3x3) I'd like to explain some peculiarities I see in this code.

    results.iloc[i-1,i] = liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])

with results.iloc[i-1,i] you are going to be indexing by row an column, as in i-1 is the row, and i is the column here. So with out changing range(1,3) this results in the following indexes being accessed, 0,1, 1,2 and that is it. I believe liteClient.compare(...) is supposed to return either a dataframe of 1x3, or a list of size 3 based on what you were trying to do inside of it, this may not be the case however, I'm not sure what object you are using to call that member function, so I don't know where the documentation for that function exists. Assuming it does return a list of size 3 or the dataframe, what you'll need to change the way you are trying to assign the data, via this:

results.iloc[i,:]  = ...

What is happening here is that iloc(...) is taking a row positional argument and a slice positional argument, here you are assigning all the columns in the result matrix at that row to the values returned by compare. With the for statement changes this will iterate over all indices in the dataframe.

 liteClient.compare(questions.iloc[0,i], questions.iloc[:,1])

This line as it currently stands you will be iterating over each column in the first row of questions.iloc, and then comparing them to the second column and all rows of the second questions.iloc.

I believe what you will want to do is change this to:

liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])

What this does is for each i, 0, 1, 2, at column 0, compare this to every row in column 1. If your questions dataframe is actually organized as 2 columns and 3 rows this should work, otherwise you will need to change how you create questions as well.

in all I believe the fixed program should look something like:

for i in range(3):
    results.iloc[i,:] = liteClient.compare(questions.iloc[i,0], questions.iloc[:,1])
Community
  • 1
  • 1
Krupip
  • 4,404
  • 2
  • 32
  • 54
  • Thank you for that very detailed explanation. I think I'm starting to see another problem here. The function is from the cortical.io "retinasdk". I think one issue is that it only takes two strings at a time. So what I really need to do is compare the first observation in the first questions column with the first in the second question column. After that, I'll have to again compare col1row1 this time with col2row2, then col2row3. Then I'll move to compare col1row2 with col2row1, then col2row2, then col2row3. I don't think my code actually does this. There should only be one output per loop. – rtaylor May 03 '17 at 15:40