parse a document with million words

Question

I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.

Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?

I appreciate any help.

Sample

abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi

oupt

abac aabc abca
xyz zyx yxz

Code

package org.reader;

import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Test {
    // To store the anagram words
    static List<String> match = new ArrayList<String>();
    // Flag to check whether the checkWorld1InMatch() was invoked.
    static boolean flagCheckWord1InMatch;

    public static void main(String[] args) {
        String fileName = "G:\\test\\sample2.txt";
        StringBuilder sb = new StringBuilder();
        // In case of matching, this flag is used to append the first word to
        // the StringBuilder once.
        boolean flag = true;

        BufferedReader br = null;
        try {
            // convert the data in the sample.txt file to list
            List<String> list = Files.readAllLines(Paths.get(fileName));

            for (int i = 0; i < list.size(); i++) {

                flagCheckWord1InMatch = true;
                String word1 = list.get(i);

                for (int j = i + 1; j < list.size(); j++) {

                    String word2 = list.get(j);

                    boolean isExist = false;

                    if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
                        isExist = checkWord1InMatch(word1);

                    }

                    if (isExist) {
                        // A word with the same characters was checked before
                        // and there is no need to check it again. Therefore, we
                        // jump to the next word in the list.
                        // flagCheckWord1InMatch = true;
                        break;
                    } else {
                        boolean result = isAnagram(word1, word2);
                        if (result) {

                            if (flag) {
                                sb.append(word1 + " ");
                                flag = false;
                            }

                            sb.append(word2 + " ");

                        }
                        if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
                            match.add(sb.toString().trim());
                            sb.setLength(0);
                            flag = true;

                        }

                    }

                }
            }

        } catch (

        IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (br != null) {
                    br.close();
                }
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }

        for (String item : match) {
            System.out.println(item);
        }

        // System.out.println("Sihwail");

    }

    private static boolean checkWord1InMatch(String word1) {
        flagCheckWord1InMatch = false;
        boolean isAvailable = false;
        for (String item : match) {
            String[] content = item.split(" ");
            for (String word : content) {
                if (word1.equals(word)) {
                    isAvailable = true;
                    break;

                }
            }
        }
        return isAvailable;
    }

    public static boolean isAnagram(String firstWord, String secondWord) {
        char[] word1 = firstWord.toCharArray();
        char[] word2 = secondWord.toCharArray();
        Arrays.sort(word1);
        Arrays.sort(word2);
        return Arrays.equals(word1, word2);
    }

}

A db, with some [REVERSE](https://msdn.microsoft.com/en-us/library/ms180040.aspx) function is a strategy — , Jun 29 '16 at 11:38
Since you're comparing each entry with possibly every other entry, this approach is pretty slow. Don't compare them directly, instead prefer a rather implicit comparison by grouping. Use a `Map` where the key is the representative of a certain anagram group, in your case the presorted String (`aabc` represents the group of `aabc, abac, caba` and so on). Then the value is either a list/set of each item for that group or a file handler to write the values, to avoid keeping them in the memory. — Tom, Jun 29 '16 at 11:40

MrSmith42 · Accepted Answer · 2016-11-07T07:27:08.863

6

For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.

20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.

This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.

Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).

edited Nov 07 '16 at 07:27

answered Jun 29 '16 at 11:36

MrSmith42

9,961
6
38
49

MrSmith is right, if you are not sure about the actual amount of words you want to process, it is best to read the file in chunks. You could, for example, read the file line by line. **1st** instantiate a scanner: `Scanner scanner = new Scanner(fileName);` **2nd** replace your for-loop with `while(scanner.hasNextLine()){` and **then** read each word with `String word = scanner.nextLine();` **but** with this, you'll have no list of words to compare to. So you might have to find another way to check if a word is an anagram or not. – GameDroids Jun 29 '16 at 11:44
To deal with billion of words, would it be an alternative to store the data in database on Server and query them with MySQL? – The Time Jun 29 '16 at 11:51
A database or use the file-system. E.g you could split up your file into files in which each has all words of one specific length. Anagrams can only exists within words of the same length. If the files contain still too many words to process them in RAM, you can think about other ways to split the problem in tinier parts. – MrSmith42 Jun 29 '16 at 12:06

Joop Eggen · Answer 2 · 2016-06-29T11:50:56.270

First for a smaller number.

As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.

Map<String, Set<String>> mapSortedToWords = new HashMap<>();

Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
    for (;;) {
        String word = in.readLine();
        if (word == null) {
            break;
        }
        String key = sorted(word);
        Set<String> words = mapSortedToWords.get(key);
        if (words == null) {
            words = new TreeSet<String>();
            mapSortedToWords.put(key, words);
        }
        words.add(word);
    }
}
for (Set<String> anagrams : mapSortedToWords.values()) {
    if (anagrams.size() > 1) {
        ... anagrams
    }
}

static String sorted(String word) {
    char[] letters = word.toCharArray();
    Arrays.sort(letters);
    return new String(letters);
}

This stores in the map a set of words. Comparable with abac aabc abca.

For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.

As written in my comment, OP could also use a filehandler as the map value, to write each item to a file, instead of keeping it in the ram (if he really reads huge files). But a `Set` is fine for smaller files. And I like, that we had the same idea :D. — Tom, Jun 29 '16 at 11:49

score 2 · Answer 3 · edited May 23 '17 at 10:30

For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,

List<String> list = Files.readAllLines(Paths.get(fileName));

AND

for (int i = 0; i < list.size(); i++)

These two lines in your programs basically question,

Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?

For most systems, answer for both above questions would be NO.

So your target is to cut down memory foot print and reduce the number of iterations.

So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.

You will find numerous questions on SO for both of above topics like,

Fastest way to incrementally read a large file

Finding anagrams for a given word

Above algorithm says that you have to first create a dictionary for your words.

Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.

Hope it helps !!

erolkaya84 · Answer 4 · 2016-06-29T11:59:40.450

0

Update

You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.

public void findAnagrams(String[] yourWords) {
    Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
    for (String word : yourWords) {
        String sortedWord = sortedString(word);
        List<String> values = anagrams.get(sortedWord);
        if (values == null) 
            values = new LinkedList<>();

        values.add(word);
        anagrams.put(sortedWord, values);
    }

    System.out.println(anagrams);
}

private static String sortedString(String originalWord) {

    char[] chars = originalWord.toCharArray();
    Arrays.sort(chars);
    String sorted = new String(chars);
    return sorted;
}

edited Jun 29 '16 at 11:59

answered Jun 29 '16 at 11:39

erolkaya84

1,769
21
28

What should the integer be? Wouldn't a String be more intuitive? – Tom Jun 29 '16 at 11:41
What would that String be? If there are "acab", "abac", "baac" and "caab" which one would you choose as a key? I believe an integer as a key might be more proper. In this case anagrams map will look like {7, ["acab", "abac", "baac","caab" ] } – erolkaya84 Jun 29 '16 at 11:44
*"which one would you choose"* obviously "aabc", since this is the pre-sorted representative for an anagram group of every combination with two "a", one "b" and one "c" and since OP sorts the char arrays anyway. And what is "7" in your example? Why isn't it "8" or "42"? – Tom Jun 29 '16 at 11:46
1

hmm I see what you mean now. Thanks. I will update my answer soon. – erolkaya84 Jun 29 '16 at 11:52
1

The problem with `int` is, if you calculate the key with the value of each char (for example), then it is possible to get equal keys with different strings (like "ac" and "bb", both have the added value "196"). You would need some kind of collision free hashcode. A String would be easier, I guess. – Tom Jun 29 '16 at 11:57

score 0 · Answer 5 · answered Jun 29 '16 at 11:49

Use a stream to read the file. That way you are only storing one word at once.

FileReader file = new FileReader("file.txt"); //filestream

String word;

while(file.ready()) //return true if there a bytes left in the stream
{
    char c = file.read(); //reads one character
    if(c != '\n') 
    {
        word+=c;
    }
    else {
    process(word); // do whatever you want
    word = "";
    }
}

parse a document with million words

5 Answers5