large scale searching and sorting, operation in java for purpose of elimination: the puzzling case of

Question

I have a big list of the following form, for purposes of this question we'll refer to it as Kraftwerk

匹

屯

牙

友

I have another list of there following form, henceforth refered to as KomputerLove

兪
yú
部首：入　
首尾分解: 人折

罙
shēn
部首：冖　
首尾分解: 冖木

叇
dài
部首：厶　
首尾分解: 云逮

Using the Kraftwerk as a reference, I want to whittle down KomputerLove so that if the main index of KomputerLove, i.e. in this example that would be ['兪','罙','叇'] does not appear in Kraftwerk we eliminate it.

I don't have much experience with such kind of searching and sorting operations, what would be the best way to accomplish this? It should be taken into account that both Kraftwerk & KomputerLove are in reality slightly large, on the order of 1,000's of indices.

Those lists are stored just as you see them in .txt files.

Thousands is not that large... Rather, it is small by today's "big data" standards. But you don't tell how those are stored in the first place, so it is difficult to provide a useful answer. — fge, Feb 13 '15 at 10:36
Well then, for starters, you may want to store them in a dedicated medium instead; and this medium will depend on your "business requirements". You describe one scenario here but I doubt it is the only one, right? — fge, Feb 13 '15 at 10:38
mmm, this is sort of just like a one timeoperation, i think just some kind of java function that could do it would do the trick, i can sort of see it, like reading in boths lists and searching thru if it doesn't find then eliminate, but i guess there might be a better way, i dont know. — , Feb 13 '15 at 10:39
whats a dedicated medium, like mysql database or mongo db or something? — , Feb 13 '15 at 10:40
Is the first line of a KompLove group alsways a single ideogram (one character)? — laune, Feb 13 '15 at 10:45
If this is limited to thousands of entries and speed it not a high priority, just read both files into strings, use a regex to find the indices in the second file, and String.replace to remove them from the first string. — Adrian Leonhard, Feb 13 '15 at 10:46
@AdrianLeonhard can you maybe write that into pseudocode or soemthing, i think it sounds good but i can't think of how to execute it — , Feb 13 '15 at 10:51
Where do you need the result? Do you just want to eliminate those not found in KomputerLove, so in the end KomputerLove textfile has less elements? If that is so, it is probably best (shortest & fastest) to read Kraftwerk into memory and use a pipe to read/write KomputerLove textfile. — Sebastian, Feb 13 '15 at 11:01

score 0 · Answer 1 · answered Feb 13 '15 at 10:57

From what I understand of your question, kraftwerk isn't actually a list, but a set of strings, and komputerLove is some sort of composite data object (I presume each line of each block is a data field?), keyed by the first line of the block. 1000s of objects isn't particularly large so I'd start off with something simple like the following, and worry about performance if it's shown to be an issue:

Set<String> indexes = new HashSet<>(); //Add the indexes however you do at present
List<KomputerLoveObject> allObjects = new LinkedList<>(); //Add the objects however you do at present
List<KomputerLoveObject> filteredObjects = allObjects.stream()
                              .filter(indexes::contains)
                              .collect(Collectors.toList());

If you're not using Java 8 you can do it in the slightly more verbose way:

Set<String> indexes = new HashSet<>(); //Add the indexes however you do at present
List<KomputerLoveObject> allObjects = new LinkedList<>(); //Add the objects however you do at present
List<KomputerLoveObject> filteredObjects = new LinkedList<>();
for (KomputerLoveObject klo : allObjects) {
    if (indexes.contains(klo)) {
        filteredObjects.add(klo);
    }
}

If performance does prove to be an issue, move the filtering of komputerLove earlier, to the point where you're loading your files so you don't iterate over the whole data set twice, and also don't keep around two Lists. Depending on how you load these objects, you may be able to speed up the loading process as well.

I don't really know how to add them in, maybe I could just do some kind of BufferedReader br = new BufferedReader(new FileReader( input_location )); while ((text = br.readLine()) != null) and populate thoe structures like so? — , Feb 13 '15 at 11:03

Adrian Leonhard · Answer 2 · 2015-02-13T11:56:08.353

Some simple pseudocode:

public String idunno() {
    // for readFromFile see:
    // http://stackoverflow.com/questions/326390/how-to-create-a-java-string-from-the-contents-of-a-file
    String kraftwerk = readFromFile("kraftwerk.txt");
    String komputerLove = readFromFile("komputerlove.txt");

    Matcher m = Pattern.compile(regex).matcher(komputerLove);
    while(m.find()) {
        // removes the found ideogramm from the first file:
        kraftwerk = kraftwerk.replaceAll(m.group(1), "");
    }

    return kraftwerk;
}

EDIT: A possible regex is: public static String regex = "(.)((\\r\\n|\\r|\\n).+){3}"; This will match a single character followed by 3 non-empty lines, with the first character accessible with the first capture group.

Aside from readFromFile, you are missing the main points of the problem. — laune, Feb 13 '15 at 12:05

laune · Accepted Answer · 2015-02-13T12:08:07.707

0

This reads all the single ideograms into a Set. A pass through the file containing the line blocks copies those where the first line is not in the ideogram Set.

public class Filter {
  Set<Character> keys = new HashSet<>();
  PrintWriter osw; 
  void checkAndDump( List<String> lines ) throws Exception {
    if( lines.size() >= 1 &&
        ! keys.contains( lines.get(0).charAt(0) ) ){
      for( String s: lines ){
        osw.println( s );
      }
      osw.println();
    }
    lines.clear();
  }

  void filter( String inpath, String outpath ) throws Exception {
    BufferedReader lr = new BufferedReader( new FileReader( inpath ) );
    osw = new PrintWriter( new FileOutputStream( outpath ) );
    String line;
    List<String> lines = new ArrayList<>();
    while( (line = lr.readLine()) != null ){
      if( line.length() > 0 ){
        lines.add( line );
      } else {
        checkAndDump( lines );
      }
    }
    checkAndDump( lines );
    osw.close();
    lr.close();
  }

  void fillSet( String path ) throws Exception {
    BufferedReader br = new BufferedReader( new FileReader( path ) );
    String line;
    while( (line = br.readLine()) != null ){
      if( line.length() > 0 ){
        keys.add( line.charAt(0) );
      }
    }
    br.close();
  }    

  public static void main( String[] args ) throws Exception {
    Filter f = new Filter();
    f.fillSet( "kraftwerk.txt" );
    f.filter( "love.txt", "lv.txt" );
  }
}

edited Feb 13 '15 at 12:08

answered Feb 13 '15 at 11:53

laune

31,114
3
29
42

Changed `Set` to `Set`: only one ideogram according to OP. – laune Feb 13 '15 at 11:59
:/ I tried to run it but I got "could not find or load main class" – Feb 13 '15 at 12:39
How did you compile and execute? – laune Feb 13 '15 at 12:40
i put it into eclipse and added all the import statements, and added in the paths to the files, and then just javac Filter.java and java Filter – Feb 13 '15 at 12:42
Do you have Filter.class in the directory where you execute `java Filter`? – laune Feb 13 '15 at 12:44
yeah it generates the Filter.class after I run the javac Filter.java and then I do java Filter from the same directory, something really wierd though is, although it says 'could not find or load main class' it generates an output file which is full of stuff. – Feb 13 '15 at 12:46
If you are on Linux: `grep main Filter.java`. This should filter a single line. - What is `echo $CLASSPATH`? - Is the "stuff" correct? – laune Feb 13 '15 at 12:49
I ran that command, this was the output: public static void main( String[] args ) throws Exception f.fillSet( "path.txt" ); – Feb 13 '15 at 12:51
there was nothing on my classpath. do i need to put the location of java to there? – Feb 13 '15 at 12:52
No. - The grep output is weird, too. This isn't all one line? – laune Feb 13 '15 at 12:53
Do you have an alias: "java" defined as "java main" or a shell script "java"? - The message "could not find or load main class" is not caused by my Java code. – laune Feb 13 '15 at 12:55
i guess it should be on one line but the path on my machine is super long so it got split to two lines. i think i can run it through eclipse, right now i'm trying to run that diff command and check the different but i don't know alot about diff, if i look at the output from diff how can i tell if something from KomputerLove has been deleted from the filtered result? – Feb 13 '15 at 12:56
Use `diff -u input.txt output.txt | less` or whatever names you have. – laune Feb 13 '15 at 12:57
so that puts a "-" symbol in from of everything that it removed? – Feb 13 '15 at 13:00
or everything that it didn't remove? how to understand that output? – Feb 13 '15 at 13:01
Yes, removed lines are prefixed by -. – laune Feb 13 '15 at 13:02
in that case I think that maybe the function might be working backwards, is that possible? – Feb 13 '15 at 13:04
or maybe my understanding of diff is backward, it should be showing, prefaced by a "-" the items that were removed from KomputerLove because they did not appear in Kraftwerk. it's showing a minus preceeding all those that *do* appear in Kraftwerk, and all those with no "-" are not part of kraftwerk – Feb 13 '15 at 13:09
f.filter( "a.txt", "b.txt"): File a.txt is longer than b.txt. (Run `wc a.txt b.txt`). The lines you see prefixed by '-' are those in a.txt which do not appear in b.txt. – laune Feb 13 '15 at 13:09
but in fact, it now is such that, all the lines prefaced by "-" are actually all the ones that should be in 'b' – Feb 13 '15 at 13:18
Well, then I misunderstood your Q. Remove '!' from `! keys.contains(...`. (Your sentence "Using...eliminate it" is not very clear.) – laune Feb 13 '15 at 13:20
i think that thing with java is because I have it open in eclipse and elsewhere in sublime simultaneously, but i can run it through eclipse with no problem – Feb 13 '15 at 13:25

large scale searching and sorting, operation in java for purpose of elimination: the puzzling case of

3 Answers3