0

I have a set of files and want to merge them to a single CSV file (one matrix). Bute, there are some Problems.

  • The files are not CSV files
  • The files do not have the same columns
  • I do not know the columns
  • There are a lot of files (over 100,000)

Inside every file are the names of the columns that should get the value 1 in the matrix. Example:

#File Alpha
dog
house
car

#File Beta
dog
cat

#resulting matrix
,dog,house,car,cat
Alpha,1,1,1,0
Beta,1,0,0,1

After some research I came up with the following Idea:

import pandas
import os
import shutil

df = pandas.DataFrame()
df['dir'] = 0

directory_0 = os.fsencode("direc0")

for file in os.listdir(directory_0):
    filename = os.fsdecode(file)
    df.append(pandas.Series( name = filename))
    with open("direc10{}".format(filename), "r") as f:
        for line in f:
            if len(line[:-1]) > 0:
                if line[:-1] in df.columns:
                    df.loc[filename, line[:-1]] = 1
                else:
                    df[line[:-1]] = 0
                    df.loc[filename, line[:-1]] = 1
df.fillna(0, inplace=True)
df.to_csv("dataset.csv")

Some Comments

  • There are two directories. The 'dir' column is for the number of the directory (0 or 1). I cut the code for the second directory because it is identical (just with another number for the directory)
  • line[:-1] because there is a "\n" I do not want

Problem This code does the work, but it needs a way to long.

#Runtime for the above code
500 files after 17 seconds
1,000 files after 50 seconds
1,500 files after 111 seconds
2,000 files after 203 seconds
2,500 files after 324 seconds
3,000 files after 489 seconds
3,500 files after 695 seconds
4,000 files after 960 seconds

Is there a way to do this way faster (100,000 files in a couple of hours)?

Thanks in advance :)

PS: Pls excuse my English, it is not my first Language

Angelo Mendes
  • 905
  • 13
  • 24
feps
  • 15
  • 6
  • [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait May 26 '19 at 01:35

1 Answers1

0

100.000 rows are not the kind of size that is easily dealt with CSV files - you will loose a lot - keping this data structure in an SQL database would be far more efficient.

Also, what makes this be less efficient is that Pandas search for column names is linear, and when you get to a few thousand words, the if line[:-1] in df.columns run for each file line will start to take its burden.

Also, for ~100000 files, even os.listdir might be slow - pathlib.Path.iterdir might be faster.

Anyway, building this as a Python dictionary, which has O(1) time to get to the keys will be a lot faster. You can make that into a dataframe later, or record the information in a CSV file directly if it is really needed.

import pathlib

dir1 = pathlib.Path("direc0")  # Pathlib will also take care of filesystem filename encodings

data = {}

for filepath in dir1.iterdir():
    for line in filepath.open():
        line = line.strip()
        data.setdefault(line, set()).add(filepath.name)

At this point you have all data in memory, with the words as keys and a set of the filenames that contain each word as value.

To write the CSV file directly, this code should suffice - since all data is in memory and the code avoids linear search, this should be fast enough. The problem is if the data does not fit in memory in first place.

import csv
from collections import defaultdict

transposed = {}
for key, values in data.items():
    for value in values:
        transposed.set_default(value, set()).add(key)


with open("dataset.csv", "wt") as file:
    writer = csv.writer(file)
    headers = list(data.keys())

    writer.writerow(["file",] + headers)
    for key, values in transposed.items():
        writer.writerow([key,] + [int(word in values) for word in headers ])
jsbueno
  • 99,910
  • 10
  • 151
  • 209