2

I want to run a machine learning algorithm as my endgame- research code that is thusfar unproven and unpublished for text mining purposes. The text is already obtained, but was scraped from warc format obtained from the Common Crawl. I'm in the process of preparing the data for machine learning purposes, and one of the analysis tasks that's desirable is IDF- Inverse Document Frequency analysis of the corpus prior to launching into the ML application proper.

It's my understanding that for IDF to work, each file should represent one speaker or one idea- generally a short paragraph of ascii text not much longer than a tweet. The challenge is that I've scraped some 15 million files. I'm using Strawberry Perl on Windows 7 to read each file and split on the tag contained in the document such that each comment from the social media in question falls into an element of an array (and in a more strongly-typed language would be of type string).

From here I'm experiencing performance issues. I've let my script run all day and it's only made it through 400,000 input files in a 24 hour period. From those input files it's spawned about 2 million output files representing one file per speaker of html-stripped text with Perl's HTML::Strip module. As I look at my system, I see that disk utilization on my local data drive is very high- there's a tremendous number of ASCII text writes, much smaller than 1 KB, each of which is being crammed into a 1 KB sector of my local NTFS-formatted HDD.

Is it a worthwhile endeavor to stop the run, set up a MySQL database on my home system, set up a text field in the database that is perhaps 500-1000 characters in max length, then rerun the perl script such that it slurps an input html file, splits it, HTML-strips it, then prepares and executes a string insert vs a database table?

In general- will switching from a file output format that is a tremendous number of individual text files to a format that is a tremendous number of database inserts be easier on my hard drive / faster to write out in the long run due to some caching or RAM/disk-space utilization magic in the DBMS?

Micah
  • 514
  • 3
  • 11
  • 3
    Sounds like a good excuse for some fun experimentation! Give it a try and see what happens? You might also want to have a look at SQLite as an alternative to MySQL – Chris Turner Jan 16 '19 at 10:59
  • 1
    I'd definitely suggest SQLite as well over MySQL. But depending on how you plan to select those, a key/value store might be better suited. Redis comes to mind first. There are very good implementations in Perl to work with it, and it's super simple to use. It also implements arrays. – simbabque Jan 16 '19 at 11:09
  • When you say you have scraped 15 million files, are these all HTML files with multiple data points inside them? Are you splitting into individual _tweets_ (for a lack of a better word), or do you build up a sort of log of what each person said, concatenating to the end of each person's file whenever there's a new tweet by that person? – simbabque Jan 16 '19 at 11:12
  • Sounds like you need the "learning experience" of implementing it both ways. Then pit them against each other -- optimizing each until you get a 'clear winner'. – Rick James Jan 16 '19 at 23:04

2 Answers2

5

A file system can be interpreted as a hierarchical key-value store, and it is frequently used as such by Unix-ish programs. However, creating files can be somewhat expensive, depending also on the OS and file system you are using. In particular, different file systems differ significantly by how access times scale with the number of files within one directory. E.g. see NTFS performance and large volumes of files and directories and How do you deal with lots of small files?: “NTFS performance severely degrades after 10,000 files in a directory.”

You may therefore see significant benefits by moving from a pseudo-database using millions of small files to a “real” database such as SQLite that stores the data in a single file, thus making access to individual records cheaper.

On the other hand, 2 million records are not that much, suggesting that file system overhead might not be the limiting factor for you. Consider running your software with a test workload and use a profiler or other debugging tools to see where the time is spent. Is it really the open() that takes so much time? Or is there other expensive processing that could be optimized? If there is a pre-processing step that can be parallelized, that alone may slash the processing time quite noticeably.

amon
  • 57,091
  • 2
  • 89
  • 149
  • 1
    So I've spent the last couple of days checkdisking and defragging the volume in question. MySQL is up and running on the same disk now and I'm back to parsing 5000 raw HTML files in about 30 seconds consistently. The issue had to do with the window index file becoming fragmented from the process of spamming the disk with a multitude of tiny files. – Micah Jan 18 '19 at 18:19
1

Whow!

A few years ago, we had massive problems in the popular cms. By the plain mostly a good performance. But it changes to the down, when sidepass inlines comes too.

So i wrote some ugly lines to find the fastest way. Note, that the ressources setting the different limits!

1st) I used the time for establishing of a direct adressable point. Everyone haves an own set of flatfiles.

2nd) I made a Ramdisk. Be sure that you have enough for your Project!

3rd) For the backup i used rsync and renundance i compressed/extracted to the Ramdisk in a tar.gz

In practical this way the fastest one is. The conversion of timecode and generating recursive folder-structures is very simple. Read, write, replace, delete too.

The final release results in processing from:

PHP/MySQL > 5 sec Perl/HDD ~ 1.2 sec Perl/RamDisk ~ 0.001 sec

When i see, what you are doing there, this construct may be usuable for you. I am not know about the internals your project.

The harddisk will live much longer, your workflow can be optimized through direct addressing. Its accessable from other stages. Will say, you can work on that base from other scripts too. As you believe, a dataprocessing in R, a notifier from shell, or anything else...

Buffering errors like MySQL are no longer needed. Your CPU no longer loops noops.