Find rows in dataframe column containing questions

Question

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:

import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')

Here's a sample of what it looks like:

        QUERY                           FREQ
0       hindi movies for adults         595
1       are panda dogs real             383
2       asuedraw winning numbers        478
3       sentry replacement keys         608
4       rebuilding nicad battery packs  541

After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:

df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)

def questions(row):
    questions_list = 
    ["what","when","where","which","who","whom","whose","why","why don't",
          "how","how far","how long","how many","how much","how old","how come","?"]
    if row['QUERY'] in questions_list:
        return 1
    else:
        return 0

df_test['QUESTIONS'] = df_test.apply(questions, axis=1)

But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!

Question has nothing to do with `machine-learning` - kindly do not spam the tag (removed). — desertnaut, Dec 27 '18 at 19:53
@W-B I get a "error: nothing to repeat at position 107" when I tried this: df_test['QUESTIONS'] = df_test['QUERY'].str.contains("what|when|where|which|who|whom|whose|why|why don't|how|how far|how long|how many|how much|how old|how come|?").astype(int) — mlenthusiast, Dec 27 '18 at 20:17
https://stackoverflow.com/questions/3675144/regex-error-nothing-to-repeat — BENY, Dec 27 '18 at 20:24

Mikhail Stepanov · Answer 1 · 2018-12-27T22:49:41.397

If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:

questions_list = ["what","when","where","which","who","whom","whose","why",
                  "why don't", "how","how far","how long","how many",
                  "how much","how old","how come","?"]

pattern = "|".join(questions_list)  # generate regex from your list 
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)

Simplified example:

df = pd.DataFrame({
             'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'], 
             'ID': [0, 1, 2]})

Create a pattern:

pattern = '|'.join(['what', 'how'])  
pattern                                                                                                                                                                         
Out: 'what|how'

Use it:

df['QUERY'].str.contains(pattern)                                                                                                                                                                  
Out[12]: 
0     True
1     True
2    False
Name: QUERY, dtype: bool

If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is

A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way

I'm getting a "error: nothing to repeat at position 107" for this approach — mlenthusiast, Dec 31 '18 at 19:31
Is there some special characters in sentences in `question_list`? And what the string at position 107 contains? — Mikhail Stepanov, Dec 31 '18 at 21:48

score 1 · Accepted Answer · answered Dec 27 '18 at 21:36

1

IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.

def questions(row):
    questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
    if row['QUERY'].split()[0] in questions_list:
        return 1
    else:
        return 0

df['QUESTIONS'] = df.apply(questions, axis=1)

You get

    QUERY                       FREQ    QUESTIONS
0   hindi movies for adults     595     0
1   are panda dogs real         383     1
2   asuedraw winning numbers    478     0
3   sentry replacement keys     608     0
4   rebuilding nicad battery packs  541 0

answered Dec 27 '18 at 21:36

Vaishali

37,545
5
58
86

This is an interesting approach, the only question I have here is would all questions start with a word like that, or is there any sentence structure where the first word can be something else and still be a question. Other than that, this would be the simplest way to do it, thanks! – mlenthusiast Dec 31 '18 at 19:23
@codingenthusiast, are there any questions that don't start with words not in question_list? Thats more a language question rather than coding question:) If you can create an exhaustive qn list, this approach will make it very simple after that. – Vaishali Dec 31 '18 at 19:38
1

Yup, that's really what the problem is, the code obviously works, but creating an exhaustive list of how a question can be asked is more tricky, and more along the lines of an NLP question. For now, I also added : elif row['QUERY'].split()[-1] in questions_list: return 1 to tackle the last question mark as well – mlenthusiast Dec 31 '18 at 19:40

Find rows in dataframe column containing questions

2 Answers2