0

I want to read multiple files at once.I have data in two files as below:

data:

123.22.21.11,sid
112.112.11.1,john
110.11.23.23,jenny
122.23.21.13,ankit  

data1:

145.123.11.1, Joaquin  

I tried a couple of answers as per this link. Below is my code:

df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(" ", "/home/cloudera/Desktop/sample/*"))))  

When I ran this code it gives me output as below:

>>> df
   123.22.21.11 145.123.11.1 Joaquin    sid
0  112.112.11.1          NaN     NaN    NaN
1  110.11.23.23          NaN     NaN    NaN
2  122.23.21.13          NaN     NaN    NaN
0  112.112.11.1          NaN     NaN   john
1  110.11.23.23          NaN     NaN  jenny
2  122.23.21.13          NaN     NaN  ankit

But when I display I need output as below and in different columns:

123.22.21.11,sid
112.112.11.1,john
110.11.23.23,jenny
122.23.21.13,ankit
145.123.11.1,Joaquin  

So how can I do it??

whatsinthename
  • 1,828
  • 20
  • 59

2 Answers2

1

Your issue is that pd.read_csv() by default wants column headers/names. Concat uses those to match on. I can pass kwarg names=None using "partial" into the map.

import glob
import os
import pandas as pd
from functools import partial
mapfunc = partial(pd.read_csv, header=None)
df = pd.concat(map(mapfunc, glob.glob(os.path.join(" ", "/home/cloudera/Desktop/sample/*"))))

Output:

              0         1
0  123.22.21.11       sid
1  112.112.11.1      john
2  110.11.23.23     jenny
3  122.23.21.13     ankit
0  145.123.11.1   Joaquin

You can see information on partial here: Using map() function with keyword arguments

Edit, per request:

It's not really pretty, but you can iterate over the directory and use a variablized "counter" to process "counter" files at one time.

# Initialize Variables
fpath = "C:/Users/5188048/Desktop/example/"
interval = 5
filenames = []

# loop through files in directory
for i, j in enumerate(os.listdir(fpath)):

    # append filenames to list, initialized previously
    filenames.append(j)

    # for every interval'th file, perform this...
    if (i+1)%interval==0:

        # use first file to initialize dataframe
        temp_df = pd.read_csv(fpath+filenames[0], header=None)

        # loop through remaining files
        for file in filenames[1:]:

            # concatenate additional files to dataframe
            temp_df = pd.concat([temp_df, pd.read_csv(fpath+file, header=None)], ignore_index=True)

        # do your manipulation here, example reset column names
        temp_df.columns = ['IP_Address', 'Name']

        # Generate outfile variable name & path
        out_file = fpath+'out_file_' + str(int((i+1)/interval)) + '.csv'

        # write outfile to csv
        temp_df.to_csv(out_file, index=False)

        # reset variable
        filenames = []

    else:

        pass
krewsayder
  • 446
  • 4
  • 9
  • Thanks for the answer. I have few more queries. How can I assign column names? If there is header present in data itself then how can I assign those headers as column names for this data? Because on particular columns I have to apply regex pattern and finally I have to save it – whatsinthename Apr 17 '19 at 16:29
  • If there are columns that are standard across the files you can rename the columns afterwards. I would not try to do more than this in 1 line of code. If you're going to change column names, I would break this up into a function or series of lines which will read separate dataframes and combine them later after adjusting the column names. – krewsayder Apr 17 '19 at 16:33
  • Yeah, I figured out how to do it. Just need one help over here. Can you just elaborate in detail that one liner code. It works but just need to understand. – whatsinthename Apr 17 '19 at 16:35
  • I have one more doubt, suppose I have 10 files out of which I have to process data of 5 files, apply regex and then save it in one file and similarly I have to do it for other 5 files and so on...so how can do it. If you don't mind can you modify your code. Just want to be little more dynamic. Anyways your answer helped but was just trying to explore different ideas. I guess there will be a for loop involved. – whatsinthename Apr 17 '19 at 16:45
  • 1
    You can use a loop in os.listdir() to iterate over files. You're definitely not doing that in one line with conditions based on file name. Not within best practices. – krewsayder Apr 17 '19 at 17:12
  • Can you edit your answer. That would be helpful since I am new to pandas api – whatsinthename Apr 17 '19 at 17:14
  • I would need more information or examples of what you want to do and desired outputs. – krewsayder Apr 17 '19 at 17:28
  • I need to read 5 files , do some transformation,save it in one file and then again read next 5 files then repeat the other steps and so on till I reach the last file in the directory. Its just to avoid single large output file that's why I want to break it in parts. – whatsinthename Apr 17 '19 at 17:30
1

I think it'd be easier and more readable to split it into a few steps. You also want to explicitly tell pandas that there are no headers by passing header=None to pd.read_csv.

# Get list of files
files = glob.glob(os.path.join(" ", "/home/cloudera/Desktop/sample/*"))
# Read list of files into a list of dataframes
df_list = [pd.read_csv(f, header = None) for f in files]
# Stack all dataframes into one (you can change the parameters as you want)
df = pd.concat(df_list, ignore_index = True, sort = False) 
m13op22
  • 2,168
  • 2
  • 16
  • 35