I would like to:
- Read hundreds of tab-delimited file into pandas DataFrame
- Decide whether to apply function based on
FileNo
- Apply function to every element on selected columns
- Append and concatenate all DataFrames into a single frame
Sample file:
ID FileNo Name A1 A2 A3
1 0 John a-b b-a a-a
2 0 Carol b-b a-b a-b
[...]
500 0 Steve a-a b-b a-b
501 0 Jack b-a b-a a-b
True dimension for each file: 2000x15000
Function: reverse the string.
flip_over = lambda x: x[::-1]
or
my_dict = {'a-b':'b-a', 'a-a':'a-a', 'b-b':'b-b', 'b-a':'a-b'}
map(my_dict)
What I currently have:
whether_to_flip = [7,15,23,36,48,85]
frames = []
base_path = "/home/user/file_"
for i in range(0, 100):
path = base_path + str(i) + ".tsv"
df = pd.read_csv(path, sep="\t", header=None)
df['FileNo'] = str(i)
if i in whether_to_flip:
for j in range(3,6):
df[j] = df[j].map(my_dict)
frames.append(df)
combined = pd.concat(frames, axis=0, ignore_index=True)
This is currently taking hours to finish reading and processing, and I hit the memory limit when I need to increase the number of files to read.
I would appreciate any help to improve this code. In particular,
- Is this the best/fastest way to apply function?
- Is this the best/fastest way to append and concatanate many DataFrames?
Thank you.