16

Is there any lib out there that can take a text (like a html document) and a list of strings (like the name of some products) and then find a pattern in the list of strings and generate a regular expression that would extract all the strings in the text (html document) that match the pattern it found?

For example, given the following html:

<table>
  <tr>
    <td>Product 1</td>
    <td>Product 2</td>
    <td>Product 3</td>
    <td>Product 4</td>
    <td>Product 5</td>
    <td>Product 6</td>
    <td>Product 7</td>
    <td>Product 8</td>
  </tr>
</table>

and the following list of strings:

['Product 1', 'Product 2', 'Product 3']

I'd like a function that would build a regex like the following:

'<td>(.*?)</td>'

and then extract all the information from the html that match the regex. In this case, the output would be:

['Product 1', 'Product 2', 'Product 3', 'Product 4', 'Product 5', 'Product 6', 'Product 7', 'Product 8']

CLARIFICATION:

I'd like the function to look at the surrounding of the samples, not at the samples themselves. So, for example, if the html was:

<tr>
  <td>Word</td>
  <td>More words</td>
  <td>101</td>
  <td>-1-0-1-</td>
</tr>

and the samples ['Word', 'More words'] I'd like it to extract:

['Word', 'More words', '101', '-1-0-1-']
Ionut Hulub
  • 6,180
  • 5
  • 26
  • 55
  • 14
    Why shouldn't it build the regex `'Product [1-3]'`? – mgilson Jul 19 '13 at 15:37
  • 5
    Related: http://stackoverflow.com/questions/616292/is-it-possible-for-a-computer-to-learn-a-regular-expression-by-user-provided-e – Gustav Larsson Jul 19 '13 at 15:37
  • @mgilson It should try to generalize as much as possible while not matching more then the examples (list of strings) do... – Ionut Hulub Jul 19 '13 at 15:39
  • 1
    @mgilson I'm guessing there should be some sense of regularization in the learning, so that \d+ has a lower cost than \d, which has a lower cost than [1-3]. That way it would generalize better. – Gustav Larsson Jul 19 '13 at 15:40
  • 3
    @IonutHulub Wouldn't it be even more "general" if the resulting regex also consumed the surrounding whitespace: `\s*(.*?)\s*`? What's the principle telling the algorithm to stop at the boundary of the `td` tags? And are you really interested in a generic regex or in discovering where the strings are located within the hierarchy of an HTML- or XML-style document? – FMc Jul 19 '13 at 15:48
  • @FMc either would work just fine. I want to extract info from a html (usually but not always) and I spend a lot of time on creating regexs. I'd like something that can do that for me using only some examples of what the output should be like. – Ionut Hulub Jul 19 '13 at 16:00
  • Simple cases like the above might be doable, but the corner cases approach infinity. Can you place more constraints on the type of input you'll be parsing? – Chip Camden Jul 21 '13 at 20:30
  • 8
    This sounds like an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). You should probably open a StackOverflow question asking about the problem you have, rather than the tool you think would be useful to solve your problem. – user2357112 Jul 21 '13 at 21:50
  • Why do you want to match `(.*?)` and not `Product\s\d`? It seems like your edging towards `.*`, which is kinda pointless... – Ben Jul 21 '13 at 21:52
  • 3
    Deducing XPath/CSS selectors would make much more sense than deducing regexes here – Eric Jul 21 '13 at 21:57
  • 1
    If something could produce regex which would identify the strings you want, wouldn't it first have to identify the strings you want, therefore making the regex unnecessary? – Crowman Jul 21 '13 at 23:59
  • 1
    It is obligatory to mention that you can't parse XML or HTML in general with regular expressions. General XML is a context free language, and HTML is at least that (maybe higher with all the non-strictness about the tag usage). So regular expressions are insufficient. – jpmc26 Jul 22 '13 at 05:25
  • 2
    What is the actual problem you are trying to solve? – Burhan Khalid Jul 22 '13 at 08:06
  • This answer to a previous question would appear to be relevant: http://stackoverflow.com/a/1732454/482420, AFAIK it is equally applicable to XML – DaveP Jul 24 '13 at 06:49
  • Sounds like you need a lexer/parser. Ever heard of PyParsing? I've never heard of a regex generator like you're asking for before. If your library can take a string that contains the surrounding elements for the regex, PyParsing or PLY would be your best bet. – kirbyfan64sos Jul 24 '13 at 19:11

7 Answers7

10

Your requirement is at the same time very specific and very general.

I don't think you would ever find any library for your purpose unless you write your own.

On the other hand, if you spend too much time writing regex, you could use some GUI tools to help you build them, like: http://www.regular-expressions.info/regexmagic.html

However, if you need to extract data from html documents only, you should consider using an html parser, it should make things a lot easier.

I recommend beautifulsoup for parsing html document in python: https://pypi.python.org/pypi/beautifulsoup4/4.2.1

Benjamin Toueg
  • 10,511
  • 7
  • 48
  • 79
  • Regexmagic is not good for my needs. I have started working of a library and I use beautifulsoup4 to parse the html. I will post it here when I have a stable version. – Ionut Hulub Jul 22 '13 at 08:12
  • This seems to be a [grammar induction](https://en.wikipedia.org/wiki/Grammar_induction) problem, since it consists of generating a regular expression that matches every string in a list. – Anderson Green Jul 04 '22 at 22:42
7

I'm pretty sure the answer to this question in the general case (without being pedantic) is no. The problem is that an arbitrary text, together with an arbitrary set of substrings of that text, do not rigorously define a single regular expression.

As a couple people have mentioned, a function could simply return .* for every set of inputs. Or it could return, for input strings ['desired', 'input', 'strings'], the regex

'(desired)+|(input)+|(strings)+'

Or plenty of other trivially correct but wholly useless results.

The issue you're facing is that in order to build a regex, you need to rigorously define it. And to do that, you need to describe the desired expression using language as expressive as the regex language you're working in... a string plus a list of substrings is not sufficient (just look at all the options a tool like RegexMagic needs to compute regular expressions in a limited environment!). In practical terms, this means that you need the regular expression you want, in order to compute it efficiently.


Of course, you could always go the million-monkeys route and attempt to evolve an appropriate regex somehow, but you're still going to have the problem of requiring a huge sample input of text + expected output in order to get a viable expression. Plus it'll take ages to run and probably be bloated six ways from Sunday with useless detritus. You'd likely be better off writing it yourself.

Henry Keiter
  • 16,863
  • 7
  • 51
  • 80
  • Perhaps it was my bad for not being clear enought but I don't want the function to construct a regex by looking at the samples, but by looking at the suroundings of the samples. So the final regex will always be of the following form: `before_regex` + `(.*?)` + `after_regex`, where `(.*?)` would catch the samples. – Ionut Hulub Jul 24 '13 at 17:06
  • 1
    @IonutHulub Requesting a longer regex doesn't change the basic problem, which is that the input you're hoping to provide is simply not expressive enough to describe a regex. I can come up with any number of regular expressions of the form `before+(group)+after`, all of which are "correct" and none of which are actually useful. `a*(sample)b*`, `b*(sample)a*`, etc... The inputs you want to use are just not sufficient to the task at hand. – Henry Keiter Jul 24 '13 at 17:26
  • @HenryKeiter Do not include the samples in the regex. Only look at the surrounding. In the example I provided, the function should notice that all 3 samples have `` before them and `` after, so it should construct a regex like `(.*?)`. This is doable... I make a prototype in ~2 hours. I'm working on some other projects for work right now but I'll post the library here when I finish it. – Ionut Hulub Jul 25 '13 at 10:33
  • @IonutHulub **How** should the function notice that all samples have `` before and `` after, as opposed to `[fubar]*` before and after? Or ` ` before and `\n ` after? And as mgilson says in the comments, why shouldn't it return the regex `'Product [1-3]'`? After all, *that* exactly replicates your desired matches, whereas to get the regex `(.*?)`, the function must somehow extrapolate that you wanted to match more than just the `['Product 1', 'Product 2', 'Product 3']` you provided. – Henry Keiter Jul 25 '13 at 13:59
  • @IonutHulub From RegexMagic's [page on this question](http://www.regexmagic.com/autogenerate.html) (admittedly a somewhat biased source, but that doesn't change the facts): "The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches, and nothing else. Usually, providing an exhaustive list of matches is exactly what we're trying to avoid." – Henry Keiter Jul 25 '13 at 14:12
  • @HenryKeiter I made I prety clear I think. The function should not return `'Product [1-3]` because it shouldn't make the regex based on the samples but based on the surrounding of the samples. Also, it cannot return `[fubar]*` because if you apply that regex it won't find the samples in the text. If it returns `\n` after instead of `` that OK, because both extract the samples from the text. – Ionut Hulub Jul 25 '13 at 14:26
4

I had a similar problem. Pyparsing is a great tool to do exactly as you said.

https://github.com/pyparsing/pyparsing

It allows you to build expressions much list a regex but much more flexible. The site has some good examples.

Here is a quick script for the problem you posed above:

from pyparsing import *
cell_contents = []
results = []
text_string="""<table>
<tr>
     <td>Product 1</td>
     <td>Product 2</td>
     <td>Product 3</td>
     <td>Product 4</td>
     <td>Product 5</td>
     <td>Product 6</td>
     <td>Product 7</td>
     <td>Product 8</td>
</tr>
</table>"""

text_string = text_string.splitlines()
for line in text_string:
    anchorStart,anchorEnd = makeHTMLTags("td")
    table_cell = anchorStart + SkipTo(anchorEnd).setResultsName("contents") + anchorEnd
    for tokens,start,end in table_cell.scanString(line):
        cell_contents = ''.join(tokens.contents)
        results.append(cell_contents)

for i in results:
    print i
MunkeyWrench
  • 356
  • 2
  • 8
2

Try this:

https://github.com/noprompt/frak

It's written in Clojure, and there are no guarantees what it outputs is the most concise expression, but seems to have some potential

Simon
  • 2,840
  • 2
  • 18
  • 26
  • You could also modify the candidate elimination algorithm, developed by Tom Mitchell to perform reg ex learning: http://artint.info/html/ArtInt_193.html. You would start with regular expressions that match sentences word for word, and then generalize them by removing words. – Simon Nov 25 '13 at 23:02
0

Perhaps it would be better to use a Python HTML parser that supports XPATHs (see this related question), look for the bits of interest in the HTML code, and then record their XPATHs - or at least the ones shared by more than one of the examples?

Community
  • 1
  • 1
m01
  • 9,033
  • 6
  • 32
  • 58
-1
const table = document.querySelector("table");
const rows = table.querySelectorAll("tr");

let array = [];

for (const row of rows) {
  const cells = row.querySelectorAll("td");
  let rowArray = [];
  for (const cell of cells) {
    rowArray.push(cell.textContent);
  }
  array.push(rowArray);
}

console.log(array);
Joel Mora
  • 71
  • 1
  • 7
-2

Rather than generating a regex, how about using a more general regex? If your data is constrained to the inner text of an element that does not itself contain elements, then this regex used with re.findall will yield a list of tuples where each tuple is (tagname, text):

r'<(?P<tag>[^>]*)>([^<>]+?)</(?P=tag)>'

You could then extract just the text from each tuple easily.

Chip Camden
  • 210
  • 1
  • 7
  • NB this regex uses python-specific syntax for named groups, but they're available in most other flavors using different syntax. – Chip Camden Jul 21 '13 at 20:49