0

I want to find names that are referenced in text files. An author can have an arbitrary number of names and titles. A match is only found if all names match (e.g. A person named "John Doe" is not matched in a text that only contains "John"

The way I have solved it now is to split the names into tokens and store the first token in a HashSet with the lower case string as a key. Each token contains a set of the next token in the name and so on.

This results in a lot of HashSet objects that add overhead. I assume that there is a better way of handling this? I would prefer a library if possible, but anything will help

I'm open to switching to Python if there are good solutions there.

  • Do you have a small sample text file to show us please? – ilango Oct 15 '12 at 21:55
  • Not where I'm at now. But think in terms of Amazon. I will actually use a similar source for the lookup values (authors). The data material to be matched are book reviews etc. that contains lots of text that I'm not interested in. – user1175332 Oct 15 '12 at 22:05
  • To clarify, do you mean you want a datastructure to store Mr John Smith, Mr John Doe, Dr John Smith, Dr John Doe, etc, efficiently? – DNA Oct 15 '12 at 22:05
  • @DNA: Yes. This will be a datastructure in memory. I will read lots of text from file and do a match against it. Currently I have an unholy combination of HashSets within HashSets but someone must have done something better – user1175332 Oct 15 '12 at 22:06

2 Answers2

0

Can you just use a regular expression? Depending on the text files, you may need to use multi-line matching as shown below.

    Pattern p = Pattern.compile("John\\s+Doe", Pattern.MULTILINE);
    Matcher m = p.matcher("I am looking for John \nDoe, I am.");        
    System.out.println(m.find());

You can also do this with command-line utilities such a pcregrep - see this related question.

Update: To address the question of storing names, a memory-efficient structure for storing related strings is a Trie, which might be of use - there are probably lots of free implementations, though there isn't one in the Java standard libraries as as far as I know. See also this question and also this one for some suggestions.

Community
  • 1
  • 1
DNA
  • 42,007
  • 12
  • 107
  • 146
0

As far as I understood your problem, you have to store arbitrary lists of names per author, and efficiently match them.

I assume you have solved the problem of parsing the names, removing non-essential / optional parts like 'Dr', and preserving particles like 'von' and 'de'. Your normalized name must be a sequence of strings in fixed case (lower case is OK, though I'd stick with upper case or title case).

Now, a List<String> or String[] would work as a key to a HashMap containing other details. This won't work well, I'm afraid, since both are mutable, and I'm not sure their hashCode() methods work right for the case.

So I'd come up with something like this:

class AuthorName(object) {
  private String[] parts;
  public AuthorName(String... name_parts) {
    assert name_parts.length > 0;
    parts = name_parts;
  }

  @Override
  public int hashCode() {
    // hashCode() that only depends on name parts
    int result = 0;
    for (int i=0; i < parts.length; i+=1) result ^= part.hashCode();
    return result;
  }
}

Map<AuthorName, ...> authors = new HashMap<AuthorName, ...>();
authors.put(new AuthorName('John', 'Doe'), ...);
assert authors.get(new AuthorName('John', 'Doe')) != 0

This does not address many possible problems, like 'Joe Random User', 'Joe R User', and 'J. R. User' be the same person. This should be addressed on a different level.

If you stated your case in more detail, with an example or two, answers could be better.

You might also be interested in the way libraries normalize author names. People use elaborate schemes to match names.

9000
  • 39,899
  • 9
  • 66
  • 104
  • I have a set of names that I want to match against a text. Let's say one of those texts is a Wikipedia article: http://en.wikipedia.org/wiki/Spiderman. Parsing that article I find a match for "Peter Parker" and "Stan Lee". – user1175332 Oct 15 '12 at 22:41