Match a large number of keywords in each line of a large file (>3 million lines; ~4 GB size)

Question

I have a large(>3 million rows; ~4 GB size) csv file with following columns:

post_category, post_content, post_date

The column 'post_content' is of particular interest to me, and contains medical-domain text of the following form:

"Extracorporeal shock wave lithotripsy (ESWL) is commonly used for fragmentation of obstructing nephrolithiasis and has demonstrated effectiveness in management of pancreatic stones, in patients with symptomatic chronic calcific pancreatitis..."

I have a separate file containing several thousand (~ 10,000) medical domain keywords in the following manner:

File: lookup.txt

Pancreas

Pancreatic

Pancreatitis

Acute Pancreatitis

Chronic Pancreatitis

Chronic Calcific Pancreatitis

.... ....

Now, I wish to search for and extract all matching keywords from each 'post_message', in each row of the 'large' csv file, and append all the matched keywords in a new column 'keywords'.

So, the aforementioned eg text should return: ('pancreatic', 'chronic calcific pancreatitis')

note: 'pancreatitis' also matches in ''chronic calcific pancreatitis' but should not be considered as the exact match is for the specific keyword.

The desired o/p is a csv file with following columns: post_category, post_content, post_date, keywords "Organ: Pancreas", "Extracorporeal shock wave lithotripsy...", "Jul 24, 2014 10:00 AM", "pancreatic;chronic calcific pancreatitis"

I have tried the following code, but it continued to run for 2 days without exiting on my machine which has 8 cores, and I killed it.

How can I make this code time-efficient / reduce the processing time considerably, if possible ?

# -*- coding: utf-8 -*-
import datetime
import multiprocessing as mp
import numpy as np
import os
import pandas as pd
import re
import sys


KEYWORDS = set(line.strip() for line in open('keywords.txt'))

def clean_raw_text(series):
    # Perform some text pre-processing to remove accented/non-ascii text etc
    #
    return processed_text


def match_indications(series):
    # Perfrom actual keyword search of text
    matches = []
    for indication in KEYWORDS:
        matchObj = re.search(indication, str(series['cleaned_post']), flags=re.U)
        if matchObj:
            matches.append(matchObj.group(0))
    return ";".join(matches)


def worker(df):
    name = mp.current_process().name
    print '%s: Processing Started...' % name
    result["cleaned_post"] = df.apply(clean_raw_text, axis=1)
    print "%s : Text Cleaning done.." % name
    result["keywords"] = result.apply(match_indications, axis=1)
    print "%s : 'Keywords matching done.." % name
    return result


if __name__ == '__main__':
    if len(sys.argv) < 3:
        print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
        print "Usage: python %s <Path-to-Input-File> <Path-to-Output-File>" % sys.argv[0]
        exit(1)

    ifpath = os.path.abspath(sys.argv[1])
    oppath = os.path.abspath(sys.argv[2])
    big_df = pd.read_csv(ifpath, header=0, engine='python', quotechar='"')
    print big_df.info()
    num_processes = mp.cpu_count()
    p = mp.Pool(processes=num_processes)
    split_dfs = np.array_split(big_df, num_processes)
    pool_results = p.map(worker, split_dfs)
    p.close()
    p.join()

    # join parts along rows
    parts = pd.concat(pool_results, axis=0)

    # merge parts to big_df
    big_df = pd.concat([big_df, parts], axis=1)
    print big_df.info()

    big_df.drop('cleaned_post', axis=1, inplace=True)
    # Drop all rows where no keyword was found
    processed_df = big_df[np.isfinite(big_df['keywords'])]
    print processed_df.info()
    ctime = datetime.datetime.now().strftime('%d-%m-%Y_%H-%M-%S')
    ofpath = os.path.join(oppath, "%s.csv"%ctime)
    processed_df.to_csv(ofpath, sep=",", index=False)

Instead of looping over 10,000 strings and treating them as regular expression source, consider building *one* regular expression (of the form `s0|s1|...|s9999`), compiling it once, and then using that. — unwind, May 25 '15 at 13:37
This question might be better suited for http://codereview.stackexchange.com/ as that's where thy review code. — Roan, May 25 '15 at 13:49

score 0 · Answer 1 · answered May 25 '15 at 13:48

0

I would cut down the number of rows to search by running grep for "Pancrea" first:

grep -i pancrea filename.csv | awk -F: '{ print $2 }' > smaller.csv

answered May 25 '15 at 13:48

fiacre

1,150
2
9
26

score 0 · Answer 2 · answered May 25 '15 at 14:03

You're trying to perform a multi-keyword search by running 10K single keyword searches over 3M texts resulting in a lot of overhead with at least 30G operations!

The proper way to speed this up is by using a more fitting algorithm like Aho-Corasick (python lib) or Rabin-Karp.

score -1 · Answer 3 · edited May 23 '17 at 12:14

-1

I would not use python to loop through your data. There are other tools e.g pandas which use of efficient C-code when needed.

Maybe this POST helps: Pandas for large data files

edited May 23 '17 at 12:14

Community

1
1

answered May 25 '15 at 13:35

Moritz

5,130
10
40
81

Match a large number of keywords in each line of a large file (>3 million lines; ~4 GB size)

3 Answers3