1

I'm trying to create several columns in a pandas DataFrame at once, where each column name is a key in a dictionary and the function returns 1 if any of the values corresponding to that key are present.

My DataFrame has 3 columns, jp_ref, jp_title, and jp_description. Essentially, I'm searching the jp_descriptions for relevant words assigned to that key and populating the column assigned to that key with 1s and 0s based on if any of the values are found present in the jp_description.


jp_tile = [‘software developer’, ‘operations analyst’, ‘it project manager’]

jp_ref = [‘j01’, ‘j02’, ‘j03’]

jp_description = [‘software developer with java and sql experience’, ‘operations analyst with ms in operations research, statistics or related field. sql experience desired.’, ‘it project manager with javascript working knowledge’]

myDict = {‘jp_title’:jp_title, ‘jp_ref’:jp_ref, ‘jp_description’:jp_description}

data = pd.DataFrame(myDict)

technologies = {'java':['java','jdbc','jms','jconsole','jprobe','jax','jax-rs','kotlin','jdk'],
'javascript':['javascript','js','node','node.js','mustache.js','handlebar.js','express','angular'
             'angular.js','react.js','angularjs','jquery','backbone.js','d3'],
'sql':['sql','mysql','sqlite','t-sql','postgre','postgresql','db','etl']}

def term_search(doc,tech):
    for term in technologies[tech]:
        if term in doc:
            return 1
        else:
            return 0

for tech in technologies:
    data[tech] = data.apply(term_search(data['jp_description'],tech))

I received the following error but don't understand it:

TypeError: ("'int' object is not callable", 'occurred at index jp_ref')
  • where is your data ? – BENY Jul 18 '19 at 15:51
  • What does the actual dataframe look like? – G. Anderson Jul 18 '19 at 15:53
  • It’s text, in the form of job postings like “software developer with java experience” - I can add examples, but thx to my NDA and I can’t disclose real data. –  Jul 18 '19 at 15:53
  • Sample data would be fine, but we can't test without some sort of data. Please see [How to create good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – G. Anderson Jul 18 '19 at 15:54
  • A 3 row sample df has been made, thx for the helpful feedback. –  Jul 18 '19 at 16:05

1 Answers1

0

Your logic is wrong you are traversing list in a loop and after first iteration it return 0 or 1 so jp_description value is never compared with complete list.

You split the jp_description and check the common elements with technology dict if common elements exists it means substring is found so return 1 else 0

def term_search(doc,tech):
    doc = doc.split(" ")
    common_elem = list(set(doc).intersection(technologies[tech]))
    if len(common_elem)>0:
        return 1
    return 0       

for tech in technologies:
    data[tech] = data['jp_description'].apply(lambda x : term_search(x,tech))
     jp_title          jp_ref  jp_description   java    javascript  sql
0   software developer  j01 software developer....  1          0        1
1   operations analyst  j02 operations analyst ..   0          0        1
2   it project manager  j03 it project manager...   0          1        0
tawab_shakeel
  • 3,701
  • 10
  • 26
  • 1
    This was really helpful! My reputation is a bit weak at the moment, but I upvoted. This technique will help me solve several similar problems in the future. –  Jul 18 '19 at 17:55