I have a large(>3 million rows; ~4 GB size) csv file with following columns:
post_category, post_content, post_date
The column 'post_content' is of particular interest to me, and contains medical-domain text of the following form:
"Extracorporeal shock wave lithotripsy (ESWL) is commonly used for fragmentation of obstructing nephrolithiasis and has demonstrated effectiveness in management of pancreatic stones, in patients with symptomatic chronic calcific pancreatitis..."
I have a separate file containing several thousand (~ 10,000) medical domain keywords in the following manner:
File: lookup.txt
Pancreas
Pancreatic
Pancreatitis
Acute Pancreatitis
Chronic Pancreatitis
Chronic Calcific Pancreatitis
.... ....
Now, I wish to search for and extract all matching keywords from each 'post_message', in each row of the 'large' csv file, and append all the matched keywords in a new column 'keywords'.
So, the aforementioned eg text should return: ('pancreatic', 'chronic calcific pancreatitis')
note: 'pancreatitis' also matches in ''chronic calcific pancreatitis' but should not be considered as the exact match is for the specific keyword.
The desired o/p is a csv file with following columns: post_category, post_content, post_date, keywords "Organ: Pancreas", "Extracorporeal shock wave lithotripsy...", "Jul 24, 2014 10:00 AM", "pancreatic;chronic calcific pancreatitis"
I have tried the following code, but it continued to run for 2 days without exiting on my machine which has 8 cores, and I killed it.
How can I make this code time-efficient / reduce the processing time considerably, if possible ?
# -*- coding: utf-8 -*-
import datetime
import multiprocessing as mp
import numpy as np
import os
import pandas as pd
import re
import sys
KEYWORDS = set(line.strip() for line in open('keywords.txt'))
def clean_raw_text(series):
# Perform some text pre-processing to remove accented/non-ascii text etc
#
return processed_text
def match_indications(series):
# Perfrom actual keyword search of text
matches = []
for indication in KEYWORDS:
matchObj = re.search(indication, str(series['cleaned_post']), flags=re.U)
if matchObj:
matches.append(matchObj.group(0))
return ";".join(matches)
def worker(df):
name = mp.current_process().name
print '%s: Processing Started...' % name
result["cleaned_post"] = df.apply(clean_raw_text, axis=1)
print "%s : Text Cleaning done.." % name
result["keywords"] = result.apply(match_indications, axis=1)
print "%s : 'Keywords matching done.." % name
return result
if __name__ == '__main__':
if len(sys.argv) < 3:
print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
print "Usage: python %s <Path-to-Input-File> <Path-to-Output-File>" % sys.argv[0]
exit(1)
ifpath = os.path.abspath(sys.argv[1])
oppath = os.path.abspath(sys.argv[2])
big_df = pd.read_csv(ifpath, header=0, engine='python', quotechar='"')
print big_df.info()
num_processes = mp.cpu_count()
p = mp.Pool(processes=num_processes)
split_dfs = np.array_split(big_df, num_processes)
pool_results = p.map(worker, split_dfs)
p.close()
p.join()
# join parts along rows
parts = pd.concat(pool_results, axis=0)
# merge parts to big_df
big_df = pd.concat([big_df, parts], axis=1)
print big_df.info()
big_df.drop('cleaned_post', axis=1, inplace=True)
# Drop all rows where no keyword was found
processed_df = big_df[np.isfinite(big_df['keywords'])]
print processed_df.info()
ctime = datetime.datetime.now().strftime('%d-%m-%Y_%H-%M-%S')
ofpath = os.path.join(oppath, "%s.csv"%ctime)
processed_df.to_csv(ofpath, sep=",", index=False)