0

I am new to processing large amount of data and I thought someone here might could help. The structure is as follows:

I have 1 parent folder called "100" Within this parent folder I have 10 subfolders, labeled PKA1, PKA2, etc. up to PKA10.

Within EACH of these I have 30 subsubdirectories relative to initial parent folder:

1eV, 2eV, 3eV, up to 30eV

In each one of these folder I have a file called PKA.dump.

I would like to copy the 20th row of each PKA.dump folder and dump into an array for easy processing. I am skeptical that such a feat is possible, it seems very complicated to me. I joined just so I could ask this question. I figured the people here would have amongst the best idea to solve this problem.

My hope would be that I would end up with 30 arrays, each with 30 different rows of data

EDIT: Here is my attempt at a code: I hae tried editing to match my needs. How can I specify the 20th row?

find foo -type f -name PKA.dump |
while read file; do
    line=$(echo $file | sed 's/.*PKA.dump\([0-9]*\)$/\1/')
    sed -n -e "$line {p; q}" $file
done
Jack
  • 307
  • 2
  • 13
  • Sorry, judging by the negative votes, I guess it can't be done. Sorry for asking a dumb question. – Jack Apr 03 '15 at 19:59
  • It is definitely possible. But SO is not a place to get the code written. The idea is for you to try and we will help you when you run into problems... – Peter Bowers Apr 03 '15 at 20:00
  • You'll probably not get a complete answer here, because you showed not much own work and you aren't stuck anywhere - you simply didn't start at all. The problem is easy. Pick a programming language (you didn't specify any!), then divide the problem into small subproblems. Learn how to read a file. Learn how to read only the 20th row. Learn how to do that for 30 directories. Learn how to do that for 10 directories. Basically, it's easy, just attack one problem at a time and once you have it, wrap your solution in another layer. – Petr Janeček Apr 03 '15 at 20:01
  • Could someone suggest a programming language that might be the best suited for such a problem? – Jack Apr 03 '15 at 20:02
  • Why does this say it cannot find PKA.dump, even though if I run just the find command, it finds all 30 of them ? find PKA1 -type f -name PKA.dump | sed -n '20p' PKA.dump – Jack Apr 03 '15 at 20:32

2 Answers2

1

Here's a Powershell script that should do what you need:

Get-ChildItem PKA.dump -r | Sort-Object $_ | ForEach-Object { Get-Content $_ | Select -Index 19 } > output.txt

The output.txt file should contain the 20th line from each file named PKA.dump in the directory structure the script is run from.

Also, here's a simple C# example:

List<string> data = new List<string>();

foreach (string filePath in Directory.EnumerateFiles(@"C:\100", "PKA.dump", SearchOption.AllDirectories))
{
    string[] lines = File.ReadAllLines(filePath);
    data.Add(lines[19]); // zero based index for 20th line
}

string[] endResult = data.ToArray();
madannes
  • 523
  • 1
  • 8
  • 14
  • I've never used c# before. How would I compile this? – Jack Apr 03 '15 at 20:19
  • @Jack: For quick development testing, you could use something like http://dotnetfiddle.net. If you're really wanting to do development, you'd probably want to download an IDE like Visual Studio. But I suspect you're just trying to do this on your local machine to solve the problem one time, is that correct? If so, if you're running Windows you could script something like this in PowerShell. – madannes Apr 03 '15 at 20:24
  • I will need to do this in the future as well. If I download visual studio and put the csharp file with the 100 parent directory and execute, will this code work as is? – Jack Apr 03 '15 at 20:26
  • @Jack: Learning an IDE like Visual Studio is an exercise in itself. Sounds like you're just needing this solution, not software development as a skill. I'd look at Powershell, perhaps http://stackoverflow.com/a/14759794/3507333 and http://stackoverflow.com/a/18848848/3507333 can help you solve your problem. – madannes Apr 03 '15 at 20:45
  • That works really well! The only issue now is the fact that it does not show them in order. I would want the results to be in order from 1eV, 2eV etc. Can that be done too? – Jack Apr 03 '15 at 21:05
  • Just need to pipe in a sort, like `Sort-Object $_`. I'll update the script in the answer. – madannes Apr 06 '15 at 12:55
0

Considering you are trying to solve using java programming language.

In order to perform read and write/copy operations on huge amount of data take help of java.nio packages classes that is basically designed to work on huge amount of data processing.

Use List/Queue to store your the lines copied from PKA.dump. No need to create so many arrays.

Steps:

  1. Read file content using java.nio package classes

  2. Write the file content into suitable data structures e.g. list/queue

  3. Proceed with your final processing.

Ranjan Kumar
  • 107
  • 1
  • 10