Reading a File's Content and then doing analysis upon them

Question

the project I am currently working on has me reading a file and then doing analysis upon the data inside the data. Using FileReader I have read each line of the file into an Array. The file looks like the following:

01 02 03 04 05 06
02 03 04 05 06 07
03 04 05 06 07 08
04 05 06 07 08 09

These are not the exact numbers but they make a good example. I am now trying to find out how many times say the number "04" appears in my list of data. I was thinking of putting all the data in a two dimensional array by breaking each line apart but I am not quite sure how to do this. Will I need a parser or is it possible to use some type of string function (like split) to split this data apart and then store it into the Array?

When dealing with such questions, I would urge you (as well as other programmers out there) to start by modeling the whole thing in code first. Once the code using sample text works, THEN add the disk & I/O capacities. — ControlAltDel, Dec 29 '16 at 18:26
No, I'm not specifically talking about sample data vs actual data. My advice is to skip the I/O plumbing implementation until you've confirmed that the logic of your program is otherwise correct — ControlAltDel, Dec 29 '16 at 19:18
Final statement: and you are really sure you want to "program" this ... instead of using some Excel or LibreOffice spread sheet instead?! — GhostCat, Dec 30 '16 at 08:52
@GhostCat Yes, I'm doing this as practice. Also why not both? — Eric, Dec 30 '16 at 15:17
If you know all the possible numbers before hand, then you need only a single dimensional integer or long array. When you encounter a number, increment the int/long at its predefined index and at the end you will have the counts in the array. — prajeesh kumar, Dec 31 '16 at 06:53
@GhostCat not a problem, though I found accepting an answer was a little difficult. I see you guys are trying to point me in the right direction without doing it for me though I still feel like I'm grasping at straws. — Eric, Jan 04 '17 at 17:40
I wish I could accept two answers as Patrick Parker and GhostCat were both very helpful in their answers. While GhostCat tried to guide me towards doing some of the work myself Patrick gave examples of multiple ways to do what I am trying to. — Eric, Jan 04 '17 at 17:46
Don't worry. I just some upvotes to Patrick; so he got compensated too ;-) — GhostCat, Jan 04 '17 at 17:50

score 1 · Answer 1 · answered Dec 29 '16 at 18:21

1

If you only need to count the 04's, you REALLY don't need to store the whole file. You could, for example, read each line and check it for 04's (and adding to a counter, or whatever). You could even just read character by character, but that might be a bit tedious for the slight (if any) efficiency gains.

If the processing you need to do on the file is more complex, this approach may not be up to the task. But unless you specify what that is, I can't say if it is or not.

answered Dec 29 '16 at 18:21

Scott Hunter

48,888
12
60
101

Eventually I would like to have a count of how many times EVERY number in the file appears. (I'm expecting for multiple occurrences of every number.) Then I will be checking the percent chance of a number having been in the same line with another number. – Eric Dec 29 '16 at 18:52
Same reasoning applies; you may need arrays or maps to store the counts, but not the file contents. – Scott Hunter Dec 29 '16 at 19:13
I gotcha, I understand it would be unneeded to store the whole file but I just figured since this is a small project for my own use it would not really matter. Though is storing the counts that much easier without storing the file contents? – Eric Dec 29 '16 at 19:29

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

You should use a Map to hold the count of occurrences, like so:

public static void main(String[] args) throws IOException {
Pattern splitter = Pattern.compile("\\s+");
try(Stream<String> stream = Files.lines(Paths.get("input.txt"))) {
    Map<String,Long> result = stream.flatMap(splitter::splitAsStream)
            .collect(Collectors.groupingBy(Function.identity(),
                    Collectors.counting()));
    System.out.println(result);
}}

Or load the data and parse it in multiple stages:

public static void main(String[] args) throws IOException {
    // 1. load the data array
    String[][] data;
    try(Stream<String> stream = Files.lines(Paths.get("numbers.txt"))) {
        data = stream.map(line -> line.split("\\s+")).toArray(String[][]::new);
    }
    System.out.format("Total lines = %d%n", data.length);

    // 2. count the occurrences of each word
    Map<String,Long> countDistinct = Arrays.stream(data).flatMap(Arrays::stream)
            .collect(Collectors.groupingBy(Function.identity(),
                    Collectors.counting()));
    System.out.println("Count of 04 = " + countDistinct.getOrDefault("04", 0L));

    // 3. calculate correlations 
    Map<String,Map<String,Long>> correlations;
    correlations = Arrays.stream(data).flatMap((String[] row) -> {
        Set<String> words = new HashSet<>(Arrays.asList(row));
        return words.stream().map(word -> new AbstractMap.SimpleEntry<>(word, words));
    }).collect(Collectors.toMap(kv -> kv.getKey(),
            kv -> kv.getValue().stream()
                    .collect(Collectors.toMap(Function.identity(), v -> 1L)),
            (map1, map2) -> {
                map2.entrySet().forEach(kv -> map1.merge(kv.getKey(), kv.getValue(), Long::sum));
                return map1;
            }));
    System.out.format("Lines with 04 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("04", 0L));
    System.out.format("Lines with both 04 and 07 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("07", 0L));
}

EDIT:

Here is a (perhaps) easier to read version that doesn't use a Stream/functional approach:

public static void main(String[] args) throws IOException {
    long lineCount = 0;
    Map<String,Long> wordCount = new HashMap<>();
    Map<String,Map<String,Long>> correlations = new HashMap<>();
    try(Stream<String> stream = Files.lines(Paths.get("numbers.txt"))) {
        Iterable<String> lines = stream::iterator;
        Set<String> lineWords = new HashSet<>();
        for(String line : lines) {
            lineCount++;
            for(String word : line.split("\\s+")) {
                lineWords.add(word);
                wordCount.merge(word, 1L, Long::sum);
            }
            for(String wordA : lineWords) {
                Map<String,Long> relate = correlations.computeIfAbsent(wordA,
                        key -> new HashMap<>());
                for(String wordB : lineWords) {
                    relate.merge(wordB, 1L, Long::sum);
                }
            }
        }
    }
    System.out.format("Total lines = %d%n", lineCount);
    System.out.println("Count of 04 = " + wordCount.getOrDefault("04", 0L));
    System.out.format("Lines with 04 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("04", 0L));
    System.out.format("Lines with both 04 and 07 = %d%n",           
        correlations.getOrDefault("04", Collections.EMPTY_MAP).getOrDefault("07", 0L));
}

Output:

Total lines = 4

Count of 04 = 4

Lines with 04 = 4

Lines with both 04 and 07 = 3

Interesting, but I have no clue what a lot of this does and I'd like to understand it better. — Eric, Dec 29 '16 at 18:57
@Eric of course, I hope you will code it in a simpler way first. But my point that you should use a Map still stands. I advise you to read on the documentation of the Map class. — Patrick Parker, Dec 29 '16 at 19:11
Thank you very much, I will do so before I ask any more questions about your response. — Eric, Dec 29 '16 at 19:31
@Eric I have modified my answer based on your updated requirements. Please remember to upvote any answers you find helpful and accept the answer which is correct, if any. — Patrick Parker, Dec 30 '16 at 13:24
I guess this helps a lot since you nearly completed the project. But I wanted to learn as I did it and I don't know how your code works. — Eric, Dec 30 '16 at 15:31
@Eric I updated my answer to include an easier to read approach. unfortunately Stack Overflow isn't an appropriate venue for a lengthy tutorial. however if you have more specific problems with a part, i.e. a single step along the way, then post that as a separate question about your problem. Also don't forget to accept the correct answer, if any. — Patrick Parker, Dec 31 '16 at 06:42
I appreciate all the help, though I understand your third example a lot more I still have many questions. Is there any way you suggest figuring out the answers to my questions or getting a short tutorial on how certain things here work? — Eric, Jan 04 '17 at 17:36
@Eric I suggest you go to the ##java channel on freenode IRC, (read the FAQ of course) then post questions in the chat as needed. — Patrick Parker, Jan 04 '17 at 17:45
As Eric can't do that ... but he is so happy with your great answer; I had some upvotes coming your way ... — GhostCat, Jan 04 '17 at 17:50
@GhostCat thanks but it's moreso the fact the questions aren't being marked answered that annoys me. At last it's been marked I see now. — Patrick Parker, Jan 04 '17 at 19:17

score 0 · Answer 3 · edited May 23 '17 at 12:01

0

Edit: I misread that you've already read the file into an array. So you could just skip to processing each entry in the array for the substrings.

Assuming you are using a text file or similar for the input, you can read the file line-by-line and count the number of "04" in each line as you read it. You could use a buffered reader like this:

String line;
while ((line = br.readLine()) != null) {
    //process each line
}

To count for the number of occurrences of your desired string, you can reference this other answer:

Occurrences of substring in a string

edited May 23 '17 at 12:01

Community

1
1

answered Dec 29 '16 at 18:31

Alex

827
8
18

I'll take a look at the link you posted, but at the moment I am using a File Reader + BufferedReader to read in the text and place it in an array. Though I feel like breaking apart each line into Multi Dimensional array would work better, I just don't know how at the moment. – Eric Dec 29 '16 at 19:00
With the additional info about the requirements you provided, yes it could be better to store the input. You could create a multi-dimensional array like you mentioned, each entry in the array could be an array of strings created using the Java split method. you would pass the space character as the delimiter. – Alex Dec 29 '16 at 19:25

GhostCat · Accepted Answer · 2016-12-29T19:17:35.310

0

You are "premature" in your design ideas; for example about using a 2D array here.

You see, you really have to get a better understanding of your requirements before you start thinking about design/implementation choices.

Example: when you only care about measuring how often some number shows up overall then using a 2D array won't do any good. Instead, you could just put all numbers into one long List<Integer>; to then use some of the fancy java8 stream operations on that for example.

But if that was just one example out of many, then other ways to manage your data in memory might be more efficient.

Beyond that: if you find that the things you will be doing with this data goes beyond simple calculations - Java might probably not be the best choice here. You see, languages like R are specially designed to do just that: crunch incredible amounts of data; and giving you "instant" access to a wide range of statistical operations of all kind.

And to answer your idea about counting the occurrences of all the different numbers; that is really simple: you use a Map<Integer, Integer> here; like in:

Map<Integer, Integer> numbersWithCount = new HashMap<>();

now you loop over your data; and for each data point:

int currentNumber = ... next number from your input data

int counterForNum;
if (numbersWithCount.containsKey(currentNumber)) {
  counterForNum = numbersWithCount.get(currentNumber) + 1;
} else {
   // currentNumber found the first time
  counterForCurrentNumber = 1;
}
numbersWithCount.put(currentNumber);

In other words: you just iterate over all incoming numbers, and for each of those, you either create a new counter; or increase one that was already stored.

And if you use a TreeMap instead of a HashMap, you even get your keys sorted. Many possibilities there ...

edited Dec 29 '16 at 19:17

answered Dec 29 '16 at 18:44

GhostCat

137,827
25
176
248

I see, well Scott also mentioned with the current requirements i might be over thinking it. So I'll copy my response to him here:I see, well Scott also mentioned with the current requirements i might be over thinking it. So I'll copy my response to him here: Eventually I would like to have a count of how many times EVERY number in the file appears. (I'm expecting for multiple occurrences of every number.) Then I will be checking the percent chance of a number having been in the same line with another number Do have any suggestions for something better or still think a list would work fine? – Eric Dec 29 '16 at 18:56
@Eric I updated my answer. Your idea from the comment can be nicely resolved using a Map; but if there are more ideas around on your side; I would actually suggest to not use Java, but something like R; or maybe python with numPy. – GhostCat Dec 29 '16 at 19:18
I see, well I have no experience at the moment with the other languages suggested, and I am trying to practice my Java while also doing this project. (This is all learning and fun for me, not for work or school.) I think I will have to re-look up the Map's to see how work to correctly get an idea of how the data is stored inside of them. Like with an Array[][] you get to choose the column and row the data is stored in, but I'm not sure how it works with a Map. Thank you for all your help so far by the way! – Eric Dec 29 '16 at 19:35
As said; it depends what you needs. A map by itself has no awareness of "order of inserts"; and it is a flat data structure, just **mapping** a key to a value. The information about the two dim layout of the input data is lost there. But on the other hand, it makes a lot of computations very easy. You might even end up with using many different ways to represent your data! – GhostCat Dec 29 '16 at 19:50
Thank you for the information, you have given me a direction to go towards though my actual project is a little more complex than the example I used to ask the question so I may end up creating another post with more questions down the line. – Eric Jan 04 '17 at 17:35

Reading a File's Content and then doing analysis upon them

4 Answers4