0

I have a function which split a string into words and then finds the word in a dataframe, if it finds it then it search for that row using for loop, which i dont want to do as it make it too slow with large dataset. i want to use row[value], and dont want to loop through whole df for each matching word.

I am new to python,and i have searched alot for it but could get what i wanted, i found the index.tolist() but i dont want to make a list, i just need the index of the first matching value.

any help or work around would be appreciated.

def cal_nega_mean(my_string):
  mean = 0.00
  mean_tot = 0
  mean_sum = 0.00
  for word in my_string.split():
    if word in df.values: #at this point if it founds then get index, so that i dont have to use  for loop in next line
      for index, row in df.iterrows(): #want to change 
        if word == row.word:   # this part
          if row['value'] < -0.40:
            mean_tot += 1
            mean += row['value']
            break
  if mean_tot == 0:
    return 0
  mean = mean_sum / mean_tot
  return round(mean,2)

example string input, there are more than 300k strings

my_string = "i have a problem with my python code" 
cal_nega_mean(my_string)
# and i am using this to get return for all records
df_tweets['intensity'] = df_tweets['tweets'].apply(lambda row: cal_nega_mean(row))

dataframe to search from

df 

index   word      value  ...

  1     python    -0.56

  2     problem   -0.78

  3     alpha     -0.91

   . . .

 9000   last    -0.41
Ahmed Sunny
  • 2,160
  • 1
  • 21
  • 27
  • 3
    Can you post some data with expected output. I am pretty sure there is a way to accomplish this without looping. – Scott Boston Jan 23 '20 at 18:40
  • 2
    When using `pandas` it's best you try to use a vectorized function to accomplish what you want as it's much faster. If you can show us the input and output it will be more helpful – r.ook Jan 23 '20 at 18:41
  • @ScottBoston added complete code and data sample – Ahmed Sunny Jan 23 '20 at 18:53
  • Have you read the Pandas docs? An explicit loop is rarely the right choice. – AMC Jan 23 '20 at 23:02
  • Does this answer your question? [Select by partial string from a pandas DataFrame](https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe) – AMC Jan 23 '20 at 23:03

3 Answers3

1

You can try using i = df[df.word == word].index[0] to get the index of the first row satisfying the condition df.word == word. Once you have the index, you can slice out the row with df.loc.

def cal_nega_mean(my_string):
    mean = 0.00
    mean_tot = 0
    mean_sum = 0.00
    for word in my_string.split():
        try:
            i = df[df.word == word].index[0]
        except:
            continue
        row = df.loc[i]
        if row['value'] < -0.40:
            mean_tot += 1
            mean += row['value']
            break
    if mean_tot == 0:
        return 0
    mean = mean_sum / mean_tot
    return round(mean,2)
alec_djinn
  • 10,104
  • 8
  • 46
  • 71
  • thats what i was looking for, to get the id of the matching record, i ll try it tomorrow and will see if it works – Ahmed Sunny Jan 23 '20 at 21:12
1

Pandas has some useful text processing functionality that should help you. I suggest you use pd.Series.str.contains().

def cal_nega_mean(my_string):
    words = '|'.join(my_string.split())
    matches = df['word'].str.contains(words, regex=True)
    mask = (df['value'] >= -0.40) & matches # don't need value >= -0.40 if you just drop those rows
    mean_tot = mask.sum()
    mean_sum = df[mask]['value'].sum()
    mean = mean_sum / mean_tot
    return round(mean, 2)

Unrelated, but I also suggest you just drop rows with "value" < -0.40 since you're ignoring them anyways.

I haven't had a chance to test this but it should do the job, and it's vectorized.

  • ok your solution seems to be short and better, i ll try tomorrow, no i am not dropping or ignoring values , its just a sample, the condition is dynamic, it depends on something else – Ahmed Sunny Jan 23 '20 at 21:10
  • your solution was fine and fast, but it gives an error when string contains brackets "(" etc – Ahmed Sunny Jan 24 '20 at 13:39
0

Here's a way using dictionary, you can convert word: value into a key, value store and use it as a lookup:

word_look_up = dict(zip(df['word'], df['value']))


def cal_nega_mean(my_string): 
    mean = 0.0
    mean_tot = 0
    mean_sum = 0.00
    words = [word for word in my_string.split() if word in word_look_up]

    if not any(words): # if no word found
        return 0
    else:
        for word in words:
            value = word_look_up[word]
            if value < -0.40:
                mean_tot += 1
                mean += value
                break

    mean = mean / mean_tot
    return round(mean, 2)


df['intensity'] = df['word'].apply(cal_nega_mean)
YOLO
  • 20,181
  • 5
  • 20
  • 40