I have a set of files and want to merge them to a single CSV file (one matrix). Bute, there are some Problems.
- The files are not CSV files
- The files do not have the same columns
- I do not know the columns
- There are a lot of files (over 100,000)
Inside every file are the names of the columns that should get the value 1 in the matrix. Example:
#File Alpha
dog
house
car
#File Beta
dog
cat
#resulting matrix
,dog,house,car,cat
Alpha,1,1,1,0
Beta,1,0,0,1
After some research I came up with the following Idea:
import pandas
import os
import shutil
df = pandas.DataFrame()
df['dir'] = 0
directory_0 = os.fsencode("direc0")
for file in os.listdir(directory_0):
filename = os.fsdecode(file)
df.append(pandas.Series( name = filename))
with open("direc10{}".format(filename), "r") as f:
for line in f:
if len(line[:-1]) > 0:
if line[:-1] in df.columns:
df.loc[filename, line[:-1]] = 1
else:
df[line[:-1]] = 0
df.loc[filename, line[:-1]] = 1
df.fillna(0, inplace=True)
df.to_csv("dataset.csv")
Some Comments
- There are two directories. The 'dir' column is for the number of the directory (0 or 1). I cut the code for the second directory because it is identical (just with another number for the directory)
- line[:-1] because there is a "\n" I do not want
Problem This code does the work, but it needs a way to long.
#Runtime for the above code
500 files after 17 seconds
1,000 files after 50 seconds
1,500 files after 111 seconds
2,000 files after 203 seconds
2,500 files after 324 seconds
3,000 files after 489 seconds
3,500 files after 695 seconds
4,000 files after 960 seconds
Is there a way to do this way faster (100,000 files in a couple of hours)?
Thanks in advance :)
PS: Pls excuse my English, it is not my first Language