26

I have multiple CSV files with values like this in a folder:

The GroupID.csv is the filename. There are multiple files like this, but the value ranges are defined in the same XML file. I'm trying to group them How can I do that?

UPDATE1: Based on BobHaffner's comments, I've done this

import pandas as pd 
import glob path =r'path/to/files' 
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=None)
    df['file'] = os.path.basename('path/to/files/'+file_)
    list_.append(df)
frame = pd.concat(list_)
print frame

to get something like this:

I need to group the values based on the bins from the XML file. I'd truly appreciate any help.

pam
  • 1,175
  • 5
  • 15
  • 28
  • 1
    There are a few questions here. I would suggest trying to get the csvs into one dataframe as a good first step. Check out the glob module and pandas read_csv() and concat() – Bob Haffner Jul 31 '15 at 01:33
  • @BobHaffner I can do that. On concatenating, it'd lose track of the file name. Wouldn't it? – pam Jul 31 '15 at 03:13
  • You could create an additional field in each dataframe with the file name. That would work, right? – Bob Haffner Jul 31 '15 at 03:21
  • then concatenate them – Bob Haffner Jul 31 '15 at 03:22
  • Yes, I'll do that. @BobHaffner – pam Jul 31 '15 at 03:24
  • Haven't added the file name yet, but without it, I've done this `import pandas as pd import glob path =r'path/to/file' allFiles = glob.glob(path + "/*.csv") frame = pd.DataFrame() list_ = [] for file_ in allFiles: df = pd.read_csv(file_,index_col=None, header=None) list_.append(df) frame = pd.concat(list_) print frame` to get something like this: `0 1 0 person1 24 1 person2 17 2 person3 98 3 person4 6 0 person2 166 1 person3 125 2 person5 172` – pam Jul 31 '15 at 03:27
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/84756/discussion-between-bob-haffner-and-pam). – Bob Haffner Jul 31 '15 at 03:31

1 Answers1

60

In order to bucket your series, you should use the pd.cut() function, like this:

df['bin'] = pd.cut(df['1'], [0, 50, 100,200])

         0    1        file         bin
0  person1   24     age.csv     (0, 50]
1  person2   17     age.csv     (0, 50]
2  person3   98     age.csv   (50, 100]
3  person4    6     age.csv     (0, 50]
4  person2  166  Height.csv  (100, 200]
5  person3  125  Height.csv  (100, 200]
6  person5  172  Height.csv  (100, 200]

If you want to name the bins yourself, you can use the labels= argument, like this:

df['bin'] = pd.cut(df['1'], [0, 50, 100,200], labels=['0-50', '50-100', '100-200'])

         0    1        file      bin
0  person1   24     age.csv     0-50
1  person2   17     age.csv     0-50
2  person3   98     age.csv   50-100
3  person4    6     age.csv     0-50
4  person2  166  Height.csv  100-200
5  person3  125  Height.csv  100-200
6  person5  172  Height.csv  100-200
firelynx
  • 30,616
  • 9
  • 91
  • 101