1

Really need help trying to find the fastest way to display the number of times each part of a 500000 part unidimensional array occurs inside the DATA file. So its a word count of every element in the huge array.

I need all of the lines in the ArrayDataFile searched for at the DATA file. I generate the array with declare array and then proceed to readarray a DATA file in my Documents folder with a for loop. Is this the best code for this job? The arrayelements in the array are searched for at the DATA file, however its a 500000 part list of items. The following code is an exact replica of the DATA file contents until line number 40 or something. The real file to be used is 600000 lines long. I need to optimized this search so as to be able to search the DATA file as fast as possible on my outdated hardware:

DATA file is

1321064615465465465465465446513213321378787542119 #### the actual real life string is a lot longer than this space provides!!

The ArrayDataFile (they are all unique string elements) is

1212121
11
2
00
215
0
5845
88
1
6
5
133
86 ##### . . . etc etc on to 500 000 items

BASH Script code that I have been able to put together to accomplish this:

#!/bin/bash
declare -a Array
readarray Array < ArrayDataFile   
for each in "${Array[@]}"
do
LC_ALL=C fgrep -o "$each" '/home/USER/Documents/DATA' | wc -l >> GrepResultsOutputFile
done

In order to quickly search a 500,000 part unidimensional array taken from the lines of the ArrayDataFile. what is the absolute best way to optimize code for speed in the search? I need to display the output in a per line basis in the GrepResultsOutputFile. Does not have to be the same code, any method that is the fastest, be it SED, AWK, GREP or any other method?

BASH the best way at all? Heard its slow.

The data file is just a huge string of numbers 21313541321 etc etc as I have now clarified. Also the ArrayDataFile is the one that has 500000 items. These are taken into an array by readarray in order to search the DATA file one by one and then get results in a per line basis into a new file. My specific question is regarding a search against a LARGE STRING, not an INDEXED file or a per line sorted file, nor do I want the lines in which my array elements from ArrayDataFile were found or anything like that. What I want is to search a large string of data for every time that every array element (taken from the ArrayDataFile) happened and print the results in the same line as they are located in the Array Data File so that I can keep everything together and do further operations. The only operation that really takes long is the actual Searching of the DATA file by utilizing the code provided in this post. I could not utilize those solutions for my query and my issue is not resolved with those answers. At least I have not been able to extrapolate a working solution for my sample code from those specific posts.

user289944
  • 133
  • 1
  • 6
  • Do you need the lines to be stored in an array? – PesaThe Dec 14 '17 at 18:08
  • If you are asking about GrepResultsOutputFile, then yes, the search result output needs to keep the same indexing because I'm using indexes to keep track of everything. THX for reading!! – user289944 Dec 14 '17 at 18:12
  • 1
    For long processing, I have used perl, which gave much quicker processing in most situations – Nic3500 Dec 14 '17 at 18:16
  • 1
    And do you need to have the lines from `arraydatafile` stored in an array for some reason, or do you just need to get the `grepresult` file? – PesaThe Dec 14 '17 at 18:29
  • Would be nice, especially if it does not make a slower to search. – user289944 Dec 14 '17 at 18:36
  • PERL's got a faster GREP or something? – user289944 Dec 14 '17 at 18:39
  • Picking the right data structure for the job makes a huge difference, just as much as the right tool for the job. If your file is unsorted, searching through it means a tool -- no matter what language it's in -- has to read from one end to the other (worst-case). If your file is sorted, that allows a bisection algorithm to seek through quickly; it doesn't make *every* tool faster, but it means *the right tool* will be far faster. – Charles Duffy Dec 14 '17 at 19:21
  • ...this is also why sometimes a SQLite database makes much more sense than a flat text file -- the database's indexes are built explicitly to make them fast and efficient to search. – Charles Duffy Dec 14 '17 at 19:22
  • ...so, you can use `join` or `comm` to find commonalities in two sorted files much more memory-efficiently than any kind of perl or grep approach that doesn't take advantage of ordering guarantees. – Charles Duffy Dec 14 '17 at 19:23
  • It looks like you're just doing `grep -f ArrayDataFile /home/$user/Documents/DATA` – William Pursell Dec 14 '17 at 19:24
  • @WilliamPursell, ...well, that would be the *efficient* way to write it (only one pass through the data file, even if it means needing enough memory to store `ArrayDataFile` in RAM). – Charles Duffy Dec 14 '17 at 19:25
  • 1
    500K lines like the sample file are a few MB. Also I see that OP wants an "alternative" `grep -c -f` with `-c` applied for each pattern, but `-cf` works for the total only, so we have to search for each pattern separately. `uniq` can make the target file smaller, i don't see how sorting alone could help. – thanasisp Dec 14 '17 at 19:44
  • @Charles Duffy I don't think that `grep -f file` posts are duplicates to this one. Here I see a request for the count per pattern in file. – thanasisp Dec 14 '17 at 20:48
  • @thanasisp, quite plausibly so. An on-point title and terser explanation that makes the OP's actual intent clear at a glance would do this question a great deal of good -- as currently posed, I doubt folks clicking through from the title (thinking that this question's answers are likely to help them with their own problem) expect answers that will help them generate match counts, for example. – Charles Duffy Dec 14 '17 at 21:01
  • @user289944, what does it mean for an array element to "happen"? I find the introductory paragraph utterly unclear as to your intent. Most people who say they want to "search for" "an array" in a data file want to find matches in common between those two sources. StackOverflow questions aren't just for your benefit as the person asking the question -- they're for the benefit of everyone else using the knowledgebase who may have a similar problem. That means making it clear in the blurb (title and introductory summary quote) *exactly* what's asked, and what answers are expected to address. – Charles Duffy Dec 14 '17 at 21:03
  • (Also, "an array" in bash is a specific, dedicated data structure -- a very different thing than just a big string). – Charles Duffy Dec 14 '17 at 21:06
  • Yes description is not clear enough and now I just learned that the DATA file has no newlines, which is a big change to the existing question. – thanasisp Dec 14 '17 at 21:06
  • Please feel free to @-notify me when you've made clarifying updates -- I'll be happy to either re-open the question or modify the duplicate list at that time. – Charles Duffy Dec 14 '17 at 21:08
  • @user289944 I think you should search at tag [tag:algorithm] or make a new question there, regardless of programming language. A C/Perl/awk program can be faster from running 500K grep processes but still a blind search will be slow for your case, its O(kn). Perhaps an algorithm focused in the data structure to store the big string can do it. Python with libraries for ML,NLP etc could be useful. – thanasisp Dec 15 '17 at 09:32
  • How faster would SQLite be compared to AWK for this job? – user289944 Dec 15 '17 at 20:34

0 Answers0