0

I have a Spring-based Java webapp. And my problem is:

I have a file which has 34MB and has 2.7 million lines. Lines are just single words one after another:

abc
abcdfg
xyz
etc

I need to choose 15 random unique lines from this file which are not next to each other in a quite fast way. I know that to search such a big files I can use Apache Lucene. Do you know if Lucene can get for me these random lines. Or maybe you have some other idea that can help me to solve this problem.

I would really appreciate any help

Thanks in advance

EDIT:

Or maybe just put this file into database [PostgreSQL]?

Mariusz Grodek
  • 629
  • 4
  • 12
  • 26
  • 3
    If it doesn't have to be perfect, you could seek to a random position within the file, read until the next beginning-of-line (wrapping around to the beginning if the end is reached), and return the next line. This will, over time, accumulate a bias towards lines after longer lines. You could correct this bias by padding all lines to the same length with whitespace. – Wug Oct 24 '12 at 13:17
  • 2
    If you want to select some lines at random, then Lucene cannot help you as it is a full-text indexing/searching library(http://en.wikipedia.org/wiki/Lucene). – Vikdor Oct 24 '12 at 13:18
  • sorry maybe you misunderstood me i need 15 unique lines which are not next to each other – Mariusz Grodek Oct 24 '12 at 13:20
  • 1
    Take a look at this answer: http://stackoverflow.com/a/2218361/1766873 –  Oct 24 '12 at 13:36
  • I will take a closer look at this. – Mariusz Grodek Oct 24 '12 at 13:54
  • Thank you Oleg your hint helped me! :) – Mariusz Grodek Dec 10 '12 at 13:22
  • Wug your comment also helped me. I am reading lines as you said and then i check probability in a way Oleg proposed. In the evening i will show you my way in details :) – Mariusz Grodek Dec 10 '12 at 13:23

2 Answers2

1

Lucene would not work for you.

Instead just generate random numbers (make sure they are not next to each other) and then read those lines from the text file.

Here is the code that does it:

  public static void main(String[] args) throws IOException
  {
    BufferedReader reader = new BufferedReader(new FileReader(
        "MyFile.txt"));
    try
    {
      final int MAX_NUM = <ENTER-YOUR-MAX-NUMBER-OF-LINES>;
      Set<Integer> randomLines = new HashSet<Integer>();
      Random rnd = new Random(System.currentTimeMillis());
      for (int i = 0; i < 15; i++)
      {
        int aNum = rnd.nextInt(MAX_NUM);
        // to make sure no lines next to each other...
        if (!randomLines.contains(aNum) && !randomLines.contains(aNum+1) && !randomLines.contains(aNum-1))
        {
          randomLines.add(aNum);
        }
      }
      List<String> result = new ArrayList<String>();
      String aLine;
      int lineNo = 0;
      while ((aLine = reader.readLine()) != null)
      {
        if (randomLines.contains(lineNo))
        {
          result.add(aLine);
        }
        lineNo++;
      }
      System.out.println("Result: " + result);
    }
    finally
    {
      reader.close();
    }
  }
user1697575
  • 2,830
  • 1
  • 24
  • 37
  • Is it efficient way? How about a scenario when there are 200 users who will do that at the same time? While loop will go through every line of the file to the given line and if it will be the line 2.6 million and then 200 users will do the same my webapp could be out of memory very quickly? – Mariusz Grodek Oct 24 '12 at 13:53
  • Yes its not that efficient. What you can also do, is to calculate upfront offsets for every new line. e.g. line 10 starts from 100th byte, line 11 starts from 105th byte etc... then when user wants to get 15 random lines instead of iterating over the file you simply look up your desired line number's offsets and get the lines directly from the file. e.g. RandomAccessFile reader = new RandomAccessFile("aaaa.txt", "r"); reader.seek(offset-of-the-desired-line); reader.readLine(); – user1697575 Oct 24 '12 at 13:58
0

I would suggest using Mongo DB (it is not as reliable as RMDBS but it is extremally quick).

http://www.mongodb.org/display/DOCS/Quickstart I would parse text file to Mongo documents and then retrieve random 3 doc's from Mongo db which would result in 3 random phrases.

1) In Java Read text file and save each line as separate doc in mongo, or execute commands like in mongo direct

> doc = { phrase : 'uniquephrase'}
> db.posts.insert(doc); 

2) in your java connect to the mongo, get collection size and select random 3 numbers from, then serve 3 docs... (or anything else)

Marcin Wasiluk
  • 4,675
  • 3
  • 37
  • 45