Is there a process for munging data from many different formats in RapidMiner?

Question

I'm trying to help my team streamline a data ingestion process that is taking up a substantial amount of time. We receive data in multiple formats and with attributes arranged differently. Is there a way using RapidMiner to create a process that:

Processes files on a schedule that are dropped into a folder (this one I think I know but I'd love tips on this as scheduled processes are new to me)
Automatically identifies input filetype and routes to the correct operator ("Read CSV" for example)
Recognizes a relatively small number of attributes and arranges them accordingly. In some cases, attributes are named the same way as our ingestion format and in others they are not (phone vs phone # vs Phone for example)

The attributes we process mostly consist of name, id, phone, email, address. Also, in some cases names are split first/last and in some they are full name.

I recognize that munging files for such simple attributes shouldn't be that hard but the number of files we receive and lack of order makes it very difficult to streamline a process without a bit of automation. I'm also going to move to a standardized receiving format but for a number of reasons that's on the horizon and not an immediate solution.

I appreciate any tips or guidance you can share.

score 2 · Answer 1 · edited Apr 08 '19 at 11:43

Your question is relative broad, so unfortunately I can't give you complete answer. But here are some ideas on how I would tackle the points you mentioned:

For a full process scheduling RapidMiner Server is what you are looking for. In that case you can either define a schedule (e.g., check regularly for new files) or even define a web service to trigger the process.
For selecting the correct operator depending on file type, you could use a combination of "Loop Files" and macro extraction to get the correct type and the use either "Branch" or "Select Subprocess" for switching to different input routes.
The "Select Attributes" operator has some very powerful options to select specific subsets only. In your example I would go for a regular expression akin to [pP]hone.* to get the different spelling variants. Also very helpful in that case would be the "Reorder Attributes" operator and "Rename by Replacing" to create a common naming schema.

A general tip when building more complex process pipelines is to organize your different tasks in sub-processes and use the "Execute Process" operator. This makes everything much more readable and maintainable. Also a good error handling strategy is important to handle unforeseen data formats.

For more elaborate answers and tips from many adavanced RapidMiner users, I also highly recommend the RapidMiner community.

I hope this gives a good starting point for your project.

Hi Robert1er, I would recommend to ask at the RapidMiner community (www.community.rapidminer.com) for specific solutions. The Unicorns there are always eager to help. — David, May 14 '19 at 08:19

Is there a process for munging data from many different formats in RapidMiner?

1 Answers1