1

I am having trouble figuring out how to extract specific text within a string. My dataset has been pulled from de-identified electronic health records, and contains a list of every medication that our patients have been prescribed. I am, however, only concerned with a specific list of medications, which I have in another table. Within each cell is the name of the medication, dose, and form (Tabs, Caps, etc.) [See image]. Much of this information is not important for my analysis though, and I only need to extract the medication names that match my list. It might also be useful to extract the first word from each string, as it is (in most cases) the name of the medication.

I have examined a number of different methods of pulling substrings, but haven't quite found something that meets my needs. Any help would be greatly appreciated.

Thanks.

Medication DescriptionMedications of Interest

        Data DRUGS;
    infile datalines flowover;
    length drug1-drug69 $20;
    array drug[69];
    input (drug1-drug69)($);
    datalines;
    AMITRIPTYLINE
    AMOXAPINE
    BUPROPION
    CITALOPRAM
    CLOMIPRAMINE
    DESIPRAMINE
    DOXEPIN
    ESCITALOPRAM
    FLUOXETINE
    FLUVOXAMINE
    IMIPRAMINE
    ISOCARBOXAZID
    MAPROTILINE
    MIRTAZAPINE
    NEFAZODONE
    NORTRIPTYLINE
    PAROXETINE
    PHENELZINE
    PROTRIPTYLINE
    SERTRALINE
    TRANYLCYPROMINE
    TRAZODONE
    TRIMIPRAMINE
    VENLAFAXINE
    AMITRIP
    ELEVIL
    ENDEP
    LEVATE
    ADISEN
    AMOLIFE
    AMOXAN
    AMOXAPINE
    DEFANYL
    OXAMINE
    OXCAP
    WELLBUTRIN
    BUPROBAN
    APLENZIN
    BUDEPRION
    ZYBAN
    CELEXA
    ANAFRANIL
    NORPRAMIN
    SILENOR
    PRUDOXIN
    ZONALON
    LEXAPRO
    PROZAC
    SARAFEM
    LUVOX
    TOFRANIL
    TOFRANIL-PM
    MARPLAN
    LUDIOMIL
    REMERON
    REMERONSOLTAB
    PAMELOR
    PAXIL
    PEXEVA
    BRISDELLE
    NARDIL
    VIVACTIL
    ZOLOFT
    PARNATE
    OLEPTRO
    SURMONTIL
    EFFEXOR
    DESVENLAFAXINE
    PRISTIQ
    ;;;;
    run;

    Data DM4_;
        if _n_=1 then set DRUGS;
        array drug[69];

        set DM4;
        do _i = 1 to countw(Description,' ().,');
        _med = scan(Description,_i,' ().,');
        _whichmed = whichc(_med, of drug[*]);
        if _whichmed > 0 then leave;
        end;
    run;

Data DM_Meds (drop = drug1-drug69 _i _med _whichmed);
  Set DM4_;
  IF _whichmed > 0 then anti = _med;
  else anti = ' ';

run;
diaferiaj
  • 315
  • 1
  • 5
  • 13
  • Would be a good idea to mention the 'number of methods' you've already tried so you don't get those again as an answer. – Joe Apr 19 '16 at 15:45
  • Ideally, I would use something like the method you describe in this Q: http://stackoverflow.com/questions/13862573/sas-prx-to-extract-substring-please) where I have my dictionary of drugs and use that to match within the 'Description' variable, and return the drug name substring while removing the dose and form. I have the drug name in another table, and am trying to use a combination of your previous answer along with this tutorial: http://www2.sas.com/proceedings/forum2007/223-2007.pdf – diaferiaj Apr 19 '16 at 16:20
  • Then likely my answer below is better, I suspect. I don't think the PRX solution is ideal here, it would be very messy. – Joe Apr 19 '16 at 16:22
  • Thank you. I will try out your suggestion. – diaferiaj Apr 19 '16 at 16:26
  • Should it be doing the above? i.e. adding 70 new variables. – diaferiaj Apr 19 '16 at 18:42
  • You can (and should) drop those 70 variables (`drop drug:` or similar) when you use this in production - I kept them so it was obvious what it was doing. – Joe Apr 19 '16 at 19:17

2 Answers2

1

This is a fairly common problem with a bunch of possible solutions depending on your needs.

The simplest answer is to create an array, assuming you have a smallish number of medicines. This isn't necessarily the fastest solution, but it would work fairly well and is simple to construct. Just get your drug list into a dataset, transpose it to horizontal (one row with lots of meds), then load it up this way. You iterate over the words in the name of the medicine and see if any of them are in the medicine list - if they are, then bingo, you have your drug! In real use of course drop the drug: variables afterwards.

This works a bit better than the inverse (searching each drug to see if it's in the medicine name) since usually there are more words in the drug list than in the medicine name. The hash solution might be faster, if you're comfortable with hashes (load the drug list into a hash table then use find() to do the same as what whichc is doing here).

data have;
  input @1 medname $50.;
  datalines;
PROVIGIL OR
ENSURE HIGH PROTEIN OR LIQD
BENADRYL 25 MG OR CAPS
ECOTRIN LOW STRENGTH 81 MG OR TBEC
SPIRONOLACTONE 25 MG PO TABS
NORVASC 5 MG OR TABS
FLUOXETINE HCL 25MG
IBUPROFEN 200MG
NEFAZODONE TABS OR CAPS 20MG
PAXIL (PAROXETINE HCL) 25MG
;;;;
run;

data drugs;
infile datalines flowover;
length drug1-drug19 $20;
array drug[19];
input (drug1-drug19) ($);
datalines;
AMITRIPTYLINE
AMOXAPINE
BUPROPION
CITALOPRAM
CLOMIPRAMINE
DESIPRAMINE
OXEPIN
ESCITALOPRAM
FLUOXETINE
FLUVOXAMINE
IMIPRAMINE
ISOCARBOXAZID
MAPROTILINE
MIRTAZAPINE
NEFAZODONE
NORTRIPTYLINE
PAROXETINE
PHENELZINE
PROTRIPTYLINE
;;;;
run;


data want;
  if _n_ = 1 then set drugs;
  array drug[19];

  set have;
  do _i = 1 to countw(medname,' ().,');
    _medword = scan(medname,_i,' ().,');
    _whichmed = whichc(_medword, of drug[*]);
    if _whichmed > 0 then leave;
  end;
run;
Joe
  • 62,789
  • 6
  • 49
  • 67
0

This should be an easy task for PROC SQL.

Let's say you have patient information in table A and drug names in table B (long format, not the wide format you gave). Here is the code filtering table A rows into table C where description in A contains drug name in B.

PROC SQL;
CREATE TABLE C AS SELECT DISTINCT * 
    FROM A LEFT JOIN B
    ON UPCASE(A.description) CONTAINS UPCASE(B.drug);
QUIT;
Feng Jiang
  • 1,776
  • 19
  • 25
  • "the execution of this query involves performing one or more cartesian product joins that can not be optimized." After which it returns the original table with an additional variable named 'Drugs'-- but it is empty. – diaferiaj Apr 19 '16 at 21:30