Questions tagged [agrep]

An approximate grep for fuzzy matching

agrep (approximate ) is a proprietary fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the operating system. It was later ported to OS/2, DOS, and Windows. It selects the best-suited algorithm for the current query from a variety of the known fastest (built-in) string searching algorithms, including Manber and Wu's bitap algorithm based on Levenshtein distances. agrep is also the search engine in the indexer program GLIMPSE. agrep is free for private and non-commercial use only, and belongs to the University of Arizona.

89 questions
26
votes
2 answers

agrep: only return best match(es)

I'm using the 'agrep' function in R, which returns a vector of matches. I would like a function similar to agrep that only returns the best match, or best matches if there are ties. Currently, I am doing this using the 'sdist()' function from the…
Zach
  • 29,791
  • 35
  • 142
  • 201
8
votes
2 answers

Create a unique ID by fuzzy matching of names (via agrep using R)

Using R, I am trying match on people's names in a dataset structured by year and city. Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names. A sample chunk of the dataset is structured as…
thomasB
  • 303
  • 3
  • 11
7
votes
2 answers

How can I get the precise common "max.distance" value for fuzzy string matching using agrep?

I am trying to figure out the best precision for fuzzy string matching between two string names using agrep. However, I will need to choose one precision "max.distance" to apply the same across all strings I am trying to match since the amount of…
Eric
  • 528
  • 1
  • 8
  • 26
6
votes
2 answers

Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

Take two character strings in C or C++, s1 and s2. It is rather trivial to check if one contains the other exactly. The following will return true in case s2 is a substring of s1. In C: strstr(s1, s2) In C++: #include str.find(str2) !=…
5
votes
1 answer

agrep in R - find *all* matches in a string (global flag)

I've got a string: string <- "I do not like green eggs and ham!" and a pattern pattern <- "(egs|ham)" I want to know how many times pattern matches string with fuzzy matching (agrep). gregexpr will do this for normal matching - I just want to know…
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
4
votes
2 answers

R: agrep results quantifier

Is there a built-in way to quantify results of agrep function? E.g. in agrep("test", c("tesr", "teqr", "toar"), max = 2, v=T) [1] "tesr" "teqr" tesr is only 1 char permutation away from test, while teqr is 2, and toar is 3 and hence not found.…
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
4
votes
0 answers

How does agrep matching work?

The agrep function gives some puzzling results and I'd like to understand its behavior better. For example: agrep("abcd",c("abc","abcde","abcef"),value=T,max.distance = 1) Returns: [1] "abc" "abcde" "abcef" But the distance between "abcd" and…
xyy
  • 547
  • 1
  • 5
  • 12
4
votes
2 answers

approximate string matching within single list - r

I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example: names <- c('jon smith','jon,…
Luke Macaulay
  • 393
  • 5
  • 14
4
votes
1 answer

R agrep: how to match with more than 1 substitution

I'm trying to match a string to a vector of strings: a <- c('abcde', 'abcdf', 'abcdg') agrep('abcdh', a, max.distance=list(substitutions=1)) # [1] 1 2 3 agrep('abchh', a, max.distance=list(substitutions=2)) # character(0) I didn't expect the…
esa606
  • 370
  • 3
  • 13
3
votes
2 answers

Multiple keyword (100s to 1000s) search (string-search algorithm) in PHP

I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the…
aditya
  • 143
  • 6
3
votes
1 answer

How to set the cost argument in the adist and agrep function?

I need some help to understand the arguments of these functions. I took the example from the help. ## To see the transformation counts for the Levenshtein distance: drop(attr(adist("kitten", "sitting", counts = TRUE), "counts")) # ins del sub # 1…
Mario GS
  • 859
  • 8
  • 22
3
votes
1 answer

R: agrep with vector pattern

I have a vector of patterns, and need to use agrep on them. The problem is that agrep seems to take only one pattern at a time. patt <- c("test","10 Barrel") lut <- c("1 Barrel","10 Barrel Brewing","Harpoon 100 Barrel…
Alexey Ferapontov
  • 5,029
  • 4
  • 22
  • 39
3
votes
1 answer

String matching records to count all instances in a dataframe

I am trying to extract all strings from rows in a dataframe that match certain criteria for example how many words are match 'corn' in each row. Here is the input. install.packages('stringr') library(stringr) dataset <- c("corn", "cornmeal", "corn…
user3570187
  • 1,743
  • 3
  • 17
  • 34
3
votes
2 answers

which R function to use for Text Auto-Correction?

I have a csv Document with 2 columns which contains Commodity Category and Commodity Name. Ex: Sl.No. Commodity Category Commodity Name 1 Stationary Pencil 2 Stationary Pen 3 Stationary Marker 4 Office…
Viamia
  • 83
  • 7
3
votes
2 answers

unexpected agrep() results related to max.distance in R

EDIT: This bug was found in 32-bit versions of R was fixed in R version 2.9.2. This was tweeted to me by @leoniedu today and I don't have an answer for him so I thought I would post it here. I have read the documentation for agrep() (fuzzy string…
JD Long
  • 59,675
  • 58
  • 202
  • 294
1
2 3 4 5 6