1

I have total 1000 txt files which are filled with data. I have copied all of them into a single txt file and have loaded it into my python code as:

data = numpy.loadtxt('C:\data.txt')

This is fine up to this point. Now, what I need is to select every 5th file from those 1000 txt files (i.e. 200 files) and load their combined content into a single variable. I am confused about how to do this.

Need help.

khan
  • 7,005
  • 15
  • 48
  • 70
  • How is the file structured? If it's one file per line, you could do `data[::5]` to select every 5th line. – Blender Sep 13 '12 at 04:54
  • it's not necessarily one file per line. – khan Sep 13 '12 at 14:16
  • Why do you need the data from every fifth file in a separate variable? – Daniel Sep 13 '12 at 14:22
  • well..it can be in a single variable as well. Even if I get each file within a single variable, I can merge them back as a single array (although after proper reshaping). – khan Sep 13 '12 at 15:05

3 Answers3

1

Why not load the files one at a time (assuming the files are data-0000 through data-0999):

datasets = []
for file_number in range(1000):
    datasets.append(numpy.loadtxt("c:\\data-%04d" %(file_number, ))

Then you can get every fifth file with: every_fifth_file = datasets[::5]. See also: Explain Python's slice notation

Community
  • 1
  • 1
David Wolever
  • 148,955
  • 89
  • 346
  • 502
  • David. Thanks, the logic you specified is good. However, in one case, I have 736 files in a single repository where each file is named file (n), where n is 1 to 736. How should I proceed? – khan Sep 13 '12 at 14:48
  • Got it. Updating the question. Thanks, David, wim, Pierre. – khan Sep 13 '12 at 17:49
1

It is crucial for us to know if the files have the same number of lines or not. If they do, you can proceed as you are already and use a slicing trick. If they don't then you will need to load the files separately to achieve what you want - the positions where files are delimited has already been lost in the merge.

Personally, I think David's suggestion is better in either case. But if you want to push ahead with slicing the big data array up, read on...

>>> import numpy as np
>>> n = 2  # number of lines in each file
>>> N = 5  # number of files
>>> x = np.eye(n*N, dtype=int)  # fake example data
>>> x
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
>>> np.vstack(x[n*i:n*(i+1)] for i in range(N)[::2])  # every second file
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
>>> np.vstack(x[n*i:n*(i+1)] for i in range(N)[1::3])  # every third file, skipping the first
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
wim
  • 338,267
  • 99
  • 616
  • 750
  • Every file has equal number of lines (precisely, every file has equal number of data values). Alright, let me use the david's code. I'll let you guys know if it works. Thanks a million. – khan Sep 13 '12 at 14:15
0

By putting all your 1000 files in a single one, you simplified the operation of loading the data in Numpy (good point), but you lsot the information about how many lines were in each of the initial files (bad point).

If you know that all your files have the same number of lines, great! Using N files, with m lines in each file, your array should have a length of N*m. So, data[:m] has the lines of your first file, data[m:2*m] of your your second file, and so forth. So, your fifth file is data[4*m:5*m], your tenth data[9*m:10*m]. Of course, you could do some simple recursion to find the lines you want. But we can use the fact that the arrays have the same number of lines: let's reshape the array!

If data has a shape of (N*m,d), where d is the number of columns of each file, you could reshape with:

data_reshaped = data.reshape(N,m,d)

or even simpler:

data.shape = (N, m, d)

Now, data is 3D. You simply access every other 5th entry with data[::5], which will give you an array of shape (N/5, m, d), whose first element will be your initial 5th array...

Note that this trick works only if the files have the same number of lines. If they don't then you're stuck with finding the lines you want from a list of the number of lines in each file.

Pierre GM
  • 19,809
  • 3
  • 56
  • 67
  • i think you are assuming there is only one element per line, which is not necessarily the case with `np.loadtxt` – wim Sep 13 '12 at 12:07
  • @wim: nope, I assume `d` elements per line (hence the output of `loadtxt` being 2D of shape `(N*m,d)`. There should be the same number of elements per line for all the individual files, else the OP wouldn't be able to load them in a single file w/o crashing. – Pierre GM Sep 13 '12 at 12:56
  • right, i was skimming too much and see now you wrote the array has a length Nxm not a shape Nxm – wim Sep 13 '12 at 13:42
  • every file has 10,234 lines of data, with each data value separated by a space. – khan Sep 13 '12 at 14:13
  • @khan Please edit your question to precise that. If you have the same nb of data on each line for all your files, you're good to go! reshaping and slicing as I showed you is the way to go. – Pierre GM Sep 13 '12 at 14:30