0

I would like to:

  • Read hundreds of tab-delimited file into pandas DataFrame
  • Decide whether to apply function based on FileNo
  • Apply function to every element on selected columns
  • Append and concatenate all DataFrames into a single frame

Sample file:

ID    FileNo    Name    A1    A2    A3
1    0     John    a-b    b-a    a-a
2    0    Carol    b-b    a-b    a-b
[...]
500    0   Steve    a-a    b-b     a-b
501    0    Jack     b-a    b-a     a-b

True dimension for each file: 2000x15000

Function: reverse the string.

flip_over = lambda x: x[::-1]
or
my_dict = {'a-b':'b-a', 'a-a':'a-a', 'b-b':'b-b', 'b-a':'a-b'}
map(my_dict)

What I currently have:

whether_to_flip = [7,15,23,36,48,85]
frames = []
base_path = "/home/user/file_"

for i in range(0, 100):
    path = base_path + str(i) + ".tsv"
    df = pd.read_csv(path, sep="\t", header=None)
    df['FileNo'] = str(i)
    if i in whether_to_flip:
          for j in range(3,6):
                 df[j] = df[j].map(my_dict)
    frames.append(df)

combined = pd.concat(frames, axis=0, ignore_index=True)

This is currently taking hours to finish reading and processing, and I hit the memory limit when I need to increase the number of files to read.

I would appreciate any help to improve this code. In particular,

  • Is this the best/fastest way to apply function?
  • Is this the best/fastest way to append and concatanate many DataFrames?

Thank you.

user3131944
  • 21
  • 1
  • 7

1 Answers1

0

First, I guess you should understand how much time you lose in reading csv vs time to invert the strings.

I can see a couple of things that can speed up the program:

Avoid the loop over the columns

You can use replace and my_dict: (ref)

if i in whether_to_flip:
    df = df.replace(my_dict)
#   df = df.replace({'A1' : my_dict, 'A2' : my_dict, 'A3' : my_dict)

I think this should give considerable improvement in performance.

List comprehension to avoid .append

This can make the syntax a bit more cumbersome, but could have some tiny efficiency gain

def do_path(x):
    return base_path + str(i) + ".csv"  



[ pd.read_csv(do_path(i), sep="\t", header=None).assign(FileNo = str(i)) if i not in whether_to_flip
  else pd.read_csv(do_path(i), sep="\t", header=None).assign(FileNo = str(i)).map(my_dict)
  for i in range(0, 100)]
Community
  • 1
  • 1
FLab
  • 7,136
  • 5
  • 36
  • 69
  • Thanks for the comment! My worry using replace was that it might suffer from transitive replacement overwriting. dict = {'a-b': 'b-a', 'b-a':'a-b'}. When it loops over a cell, and sees 'a-b', it would replace it with 'b-a', however, while staying in the same cell, it may apply the second key which would revert the cell back to its original value. That's why I thought I might use column names explicitly and map column-wise rather than letting replace apply iteratively. Perhaps I am wrong about this? – user3131944 Apr 10 '17 at 09:08
  • Re: appending part: I was wondering whether it is preferable to concatenate the files into one big file, read it into pandas dataframe, and then process it (instead of read each file, process, and concatenate). – user3131944 Apr 10 '17 at 09:10
  • Timing: when reading only 10 files, it takes 12.67 (time.clock()) without inversion, and 33.28 with inversion. – user3131944 Apr 10 '17 at 09:14
  • Concerning replace, this will not be an issue. Try this: pd.Series([1,2,3]).replace({1 : 2, 2 : 3, 3 : 4}). Also, in the commented part of the code I show how to use replace with explicit column names – FLab Apr 10 '17 at 09:14
  • Strange enough, when I use the original for loop with map over columns instead of replace, it takes 27.65. :/ – user3131944 Apr 10 '17 at 09:19