String Matching with very large file in R

Question

I have a very large RDS file of articles (13GB). The dataframe size in R's global environment is ~6GB

Each article has an ID, a date, POS tagged body text, a pattern which is nothing but two or three words with their POS tag. and some other metadata.

structure(list(an = c("1", "2", "3", "4", "5"), pub_date = structure(c(11166, 8906, 12243, 4263, 13077), class = "Date"), 
source_code = c("1", "2", "2", "3", "2"), word_count = c(99L, 
97L, 30L, 68L, 44L), POStagged = c("the_DT investment_NN firm_NN lehman_NN brothers_NNS holdings_NNS said_VBD yesterday_NN that_IN it_PRP would_MD begin_VB processing_VBG its_PRP$ own_JJ stock_NN trades_NNS by_IN early_RB next_JJ year_NN and_CC end_VB its_PRP$ existing_VBG tradeclearing_NN contract_NN with_IN the_DT bear_NN stearns_VBZ companies_NNS lehman_NN which_WDT is_VBZ the_DT last_JJ big_JJ securities_NNS firm_NN to_TO farm_VB out_RP its_PRP$ stock_NN trade_NN processing_NN said_VBD it_PRP would_MD save_VB million_CD to_TO million_CD annually_RB by_IN clearing_VBG its_PRP$ own_JJ trades_NNS a_DT bear_NN stearns_VBZ spokesman_NN said_VBD lehmans_NNS business_NN contributed_VBD less_JJR than_IN percent_NN to_TO bear_VB stearnss_NN clearing_NN operations_NNS", 
"six_CD days_NNS after_IN she_PRP was_VBD introduced_VBN as_IN womens_NNS basketball_NN coach_NN at_IN wisconsin_NN with_IN a_DT fouryear_JJ contract_NN nell_NN fortner_NN resigned_VBD saying_VBG she_PRP wants_VBZ to_TO return_VB to_TO louisiana_JJR tech_NN as_IN an_DT assistant_NN im_NN shocked_VBN said_VBD associate_JJ athletic_JJ director_NN cheryl_NN marra_NN east_JJ carolina_NN came_VBD from_IN behind_IN with_IN two_CD runs_NNS in_IN the_DT seventh_JJ inning_NN and_CC defeated_VBD george_NN mason_NN in_IN the_DT colonial_JJ athletic_JJ association_NN baseball_NN tournament_NN in_IN norfolk_NN johnny_NN beck_NN went_VBD the_DT distance_NN for_IN the_DT pirates_NNS boosting_VBG his_PRP$ record_NN to_TO the_DT patriots_NNS season_NN closed_VBD at_IN", 
"tomorrow_NN clouds_NNS and_CC sun_NN high_JJ low_JJ", "the_DT diversity_NN of_IN the_DT chicago_NN financial_JJ future_NN markets_NNS the_DT chicagoans_NNS say_VBP also_RB enhances_VBG their_PRP$ strength_NN traders_NNS and_CC arbitragers_NNS can_MD exploit_VB price_NN anomalies_NNS for_IN example_NN between_IN cd_NN and_CC treasurybill_NN futures_NNS still_RB nyfe_JJ supporters_NNS say_VBP their_PRP$ head_NN start_VB in_IN cd_NN futures_NNS and_CC technical_JJ advantages_NNS in_IN the_DT contract_NN traded_VBN on_IN the_DT nyfe_NN mean_VBP that_IN the_DT chicago_NN exchanges_NNS will_MD continue_VB to_TO play_VB catchup_NN", 
"williams_NNS industries_NNS inc_IN the_DT manufacturing_NN and_CC construction_NN company_NN provides_VBZ steel_NN products_NNS to_TO build_VB major_JJ infrastructure_NN it_PRP has_VBZ been_VBN involved_VBN with_IN area_NN landmark_NN projects_NNS including_VBG rfk_JJ stadium_NN left_VBD the_DT woodrow_JJ wilson_NN bridge_NN and_CC the_DT mixing_NN bowl_NN"
), phrases = c("begin processing", "wants to return", "high", 
"head start in", "major"), repeatPhraseCount = c(1L, 1L, 
1L, 1L, 1L), pattern = c("begin_V", "turn_V", "high_JJ", 
"start_V", "major_JJ"), code = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_), match = c(TRUE, 
TRUE, TRUE, TRUE, TRUE)), .Names = c("an", "pub_date", "source_code", "word_count", "POStagged", "phrases", "repeatPhraseCount", "pattern", 
"code", "match"), row.names = c("4864065", "827626", "6281115", 
"281713", "3857705"), class = "data.frame")

My goal is to detect (for each row) the presence of pattern in POStagged.

The pattern column is a fixed list that I personally constructed. The list is 465 words/phrases with their POS.

I want to perform a match where I could differentiate between words such as doubt when it is used as a verb or noun. Basically to determine context.

However, in some cases instead of words, I have phrases, where the end of a phrase might be a changing pattern. For example, the phrase "might not be able to make the deal" where "be able to make the deal" could be any verb-phrase (e.g. be able to conclude the deal). My attempts were varied and am not sure if I am going about this in the right way:

--might_MD not_RB _VP (this works and picks up ***might not*** but is clearly wrong since the verb phrase after it is not picked)

If I use fixed() and simply then, str_detect works and the execution is very fast. However, fixed() is surely missing some cases (as described above) and I am not able to compare results to be sure. Here is an example:

str_detect("might_MD not_RB be able to make the deal", "might_MD not_RB [A-Za-z]+(?:\\s+[A-Za-z]+){0,6}")
TRUE

str_detect("might_MD not_RB be able to make the deal", fixed("might_MD not_RB [A-Za-z]+(?:\\s+[A-Za-z]+){0,6}"))
FALSE

https://stackoverflow.com/a/51406046/3290154

My desired output is an additional column in my dataframe with a TRUE/FALSE result telling me if pattern is seen in POStagged or not.

## Attempt 1 - R fatally crashes
## this works in a smaller sample but bombs R in a large dataframe
df$match <- str_detect(df$POStagged, df$pattern)

## Attempt 2
## This bombs (using multidplyr and skipping some lines of code)
partition(source_code, cluster=cl) %>%
    mutate(match=str_detect(POStagged, pattern)) %>%
    filter(!(match==FALSE)) %>%
    filter(!is.na(match)) %>%
    collect()

##I get this error: Error in serialize(data, node$con) : error writing to connection

Which based on my understanding is because of limitations with the way multidplyr handles memory and how it loads data in memory (https://github.com/hadley/multidplyr/blob/master/vignettes/multidplyr.md). However, since multidplyr is using the parallel package and if I extrapolate here, I should still be ok - if I split say my data into 5 copies, then 6*5 = 30GB plus any packages and so on.

## Attempt 3 - I tried to save the RDS to a csv/txt file and use the chuncked package, however, the resulting csv/txt was over 100GB.

## Attempt 4 - I tried to run a for loop, but I estimate it will take ~12days to run

I read a little bit about the Greediness of regular expressions and so I tried to modify my pattern column (make my regex's lazy) by appending ?+. However, going this route means I can't use fixed() since all my matches are false. Any help in the right direction is much appreciated!

https://stringr.tidyverse.org/articles/regular-expressions.html

What do 'lazy' and 'greedy' mean in the context of regular expressions?

I'm trying to understand your goal based on your code, but I'm not sure I get it. Could you state it in words, please? It seems like you are trying to detect and flag all the rows of your data frame where (some? all?) of the space-separated strings in the `pattern` column occur in the `POStagged` column. Is this correct? And you're using `str_detect`... because you assume it will be faster than `grepl`? It would also help if you would share a few rows of data, (say, 5-10) with the desired results. Without seeing that, it's very hard to figure out if `fixed()` is a viable option. — Gregor Thomas, Sep 25 '18 at 13:42
And why are you using `lapply` inside `preprocess` when you seem to be only giving it a string column as input? I'm not sure what you're running it on because you run it on `df$variable`, but your sample data doesn't contain a column named `variable`... is `df$variable` a list column? Otherwise the `lapply` seems like a huge inefficiency. When you share more sample data, please do it in a way that the column classes are clear - `dput()` is best for this as it gives a copy/pasteable version of the exact data structure. — Gregor Thomas, Sep 25 '18 at 13:48
The new example helps a lot. Some questions remain: (1) I don't know what you mean by *"I don't want an exact match, so for instance, i would like to detect "likely" as well as "very likely"*. Neither "likely" nor "very likely" appear in your data - is that supposed to be an example of strings to match, or are you being vague about how likely a match is to actually be a match? How close does a match need to be? Can you give examples of non-exact matches that you would still like to catch? — Gregor Thomas, Sep 27 '18 at 15:15
(2) The first three patterns in your example seem like single terms (I think?), but the fourth pattern is `"the_DT _JJS NP"`. Do you need to find that entire term, or are you looking for, say all of "the_DT` and `_JJS` and `NP` anywhere, but not necessarily consecutively? (Is that what the `patternList`, which shows up in some of your code but not your data is doing?) — Gregor Thomas, Sep 27 '18 at 15:20
(3) As far as I can tell, the patterns in your sample data don't occur in the `POStagged` column, so the correct result is `FALSE`. Could you either edit the example data so that some of the results should be true, or explain why I'm wrong about the all false result? — Gregor Thomas, Sep 27 '18 at 15:26
From where we are now, we can see that `str_detect` is a good choice because you need something vectorized over both the string and the pattern, and `grepl` is only vectorized over the string. However, any regex patterns over such large data have the potential to be quite slow. Once we have a clearer idea of your problem the easiest way to speed things would be to edit your data so we can use fixed patterns instead of regex (if possible). But I need you to be more specific before I can tell if that will be possible. — Gregor Thomas, Sep 27 '18 at 15:36
Lastly, none of your patterns (at least in the data you shared) have quantifiers like `?`, `*`, `+`, so the bit about lazy vs greedy quantifiers seems entirely irrelevant. Adding in quantifiers (if unnecessary) will just slow things down. I'd suggest editing it out of your question to clean things up a bit. (Or, if you do need quantifiers, show why/where.) — Gregor Thomas, Sep 27 '18 at 15:39
Thanks @Gregor - I will update the question and examples shortly and clarify the point on quantifiers - I made some updates since yesterday. — Cola4ever, Sep 27 '18 at 15:44
Thanks, please do a relatively thorough overhaul. Your question is getting long enough it's hard to imagine many people reading through the whole thing. You don't need to call out updates at the bottom, just integrate them into the text. — Gregor Thomas, Sep 27 '18 at 16:00

score 0 · Answer 1 · answered Nov 17 '18 at 14:56

Maybe you can make faster progress and get better results when you use Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing instead? This is a totally different approach, of course, I know, sorry. So just in case you are not aware of it for some reason.

String Matching with very large file in R

1 Answers1