0

I have a very large csv file (213,265 rows) with many columns.

In one of those columns I have some emails seperated by commas. A trimmed down version of the csv file looks like this:

enter image description here

I would like to use Miller to seperate out those emails into their own rows, but also copy down ALL the other columns in the spreadsheet (many of which aren't shown here in this simple example).

Following on with this example, I would like to end with something like this. But keep in mind the real spreadsheet has many other columns before and after the email column:

enter image description here

Is that possible to do with Miller (or another similar tool)? What would the command look like?

TinyTiger
  • 1,801
  • 7
  • 47
  • 92

1 Answers1

2

The verb is nest. Starting from

company,address,email
anna,123 fake,"anna@ciao.it,annac@gfail.com,a@box.net"

and running

mlr --csv nest --explode --values --across-records --nested-fs "," -f email input.csv

you will have

+---------+----------+-----------------+
| company | address  | email           |
+---------+----------+-----------------+
| anna    | 123 fake | anna@ciao.it    |
| anna    | 123 fake | annac@gfail.com |
| anna    | 123 fake | a@box.net       |
+---------+----------+-----------------+

If you have a "bad" CSV, you could have some problems and you should try to clean it. A generic clean command could be this one:

mlr --csv -N clean-whitespace then remove-empty-columns then skip-trivial-records then cat -n sample.csv | mlr --csv nest --explode --values --across-records --nested-fs "," -f Email >output.csv

It removes empty rows, empty columns and wrong white spaces.

aborruso
  • 4,938
  • 3
  • 23
  • 40
  • 1
    Thank you, this works for the most part. But some rows are corrupted, with a second row being added with mismatching columns. I suspect it has something to do with my data and/or missing Miller commands. So I have created a sample file showing the input and output of 3 effected rows. Would you mind taking a look help me figure out why Miller isn't processing them correctly? Here is the csv sample file, the link will expire in 6 days: https://filebin.net/jepwvcyvjd4euzut – TinyTiger Dec 16 '21 at 22:30
  • 1
    @TinyTiger I have edit my answer. Please try my cleaning suggestion. Please try it using the sample file you have put here. – aborruso Dec 17 '21 at 07:10