1

Click here to view table

I think this is a simple task, but I'm a biologist who only knows a teeny bit of code and after several days of trying to figure this out, I'm out of ideas.

Using terminal on a Mac. I have a CSV file that I want to split into separate files by row (162 rows) and I want to name the file by the content of the first and second column (genus_species). Then I need all 162 genus_species to be saved as HTML files.

I have only attempted the "splitting" part with Ruby (recommendation from StackExchange/overflow). Below are some of my attempts. They are frankensteins of helpful-ish forums, and after each I made a little comment on why it did not work.

Example HTML

<!DOCTYPE html>
<html><head>
<meta charset="UTF-8">
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script></head>
<body>
<h1><em><!-- Species name --></em> - <!-- Common name --></h1>
<h2>Status</h2>
<p></p>
<h2>Info</h2>
<p></p>
<h2>Time of year this bee is seen</h2>
<p></p>
<h2>Identification</h2>
<p></p>
<h3>Similar Species</h3>
<p></p>
<h2>Flowers</h2>
<p></p>
<h2>Sociality</h2>
<p></p>
<h2>Nest</h2>
<p></p>
<div id="refs" class="references">

--<br>More information:<br> <!-- <a href="https://bugguide.net/node/view/70932">Bug Guide</a> --></div>
</body></html>

More Info Based on Comments

Here are some lines copied from the text file:

Genus,species,Common name,Status,Info,Time of year this bee is seen,Identification,Similar Species,Flowers,Sociality,Nest,Bug Guide,Discover Life,Other,
Agapostemon,melliventris,Honey-tailed Striped-Sweat bee,Secure G5,Excavates into deep burrows in ground nests,March-December,Agapostemon males have black and yellow stripes on the abdomen. Females have a yellow band on the lower margin of the clypeus.,All other Agapostemon species,Wide variety of plants,Solitary,"Deep, underground excavation",https://bugguide.net/node/view/70932,https://www.discoverlife.org/20/q?search=Agapostemon+melliventris,https://explorer.natureserve.org/Taxon/ELEMENT_GLOBAL.2.928401/Agapostemon_melliventris,
Agapostemon,sericeus,Silky Striped Sweat Bee,Secure G5,"Not choosy about lawn, as long as flowers are present",April-October,Agapostemon males have black and yellow stripes on the abdomen. A. sericeus males have a tooth on its hind femur. Female has metallic green abdomen.,All other Agapostemon species,Wide variety of plants,Solitary,Ground-nester in loamy soils,https://bugguide.net/node/view/83023,https://www.discoverlife.org/mp/20q?search=Agapostemon+sericeus,https://www.sharpeatmanguides.com/sweat-bees,
Agapostemon,splendens,Brown-winged Striped-Sweat Bee,Secure G5,This is the most common Agapostemon found in the southeast region,April-October,Agapostemon males have black and yellow stripes on the abdomen. A. splendens have brown wings. The female abdomen is often somewhat bluish.,All other Agapostemon species,"Jacquemontia reclinata, wide variety of plants",Solitary,Ground-nester in sandy soils,https://bugguide.net/node/view/74478,https://www.discoverlife.org/mp/20q?search=Agapostemon+splendens,,

Updated code I've tried based on comments. This worked and I think it's heading in the direction I want, but it's hard to tell in the terminal window:

f = File.new("bee_key_fact_sheet .csv")
f.each_line { |line| puts line }
      Currently playing with some kind of File.write line to add here and then close? 

Attempt #1

file = File.open("bee_key_fact_sheet.csv")
    awk   
        '(NR==1){header=$0;next}
         (NR%l==2) {
         close(file); 
         file=sprintf("%s.%0.5d.csv",FILENAME,++c)
         sub(/csv[.]/,"",file)
         print header > file
            }
            {f.write}' 
                File.close

#AWK not recognized, asks to "display all possibilities (y/n)" I tried returning "y" and "yes" and both times it says my answer is not recognized

Attempt #2

file_data = File.read("bee_key_fact_sheet.csv").split 

#This works but splits by each comma

Attempt #3

file_data = File.foreach("bee_key_fact_sheet.csv") { |line| puts line}.split  

#This returned something slightly less messy than splitting by each comma but got this error message "undefined method `split' for nil:NilClass"

Attempt #4

bee_key_fact_sheet.csv.foreach('so1.csv', :headers => true, :col_sep => ",", :skip_blanks => true) do |row|
  id, name = row[0], row[1]
  unless (id =~ /#/)
    names = name.split
  end

#This returned nothing

halfer
  • 19,824
  • 17
  • 99
  • 186
amanda
  • 13
  • 5
  • 1
    Why is this question tagged with `bash`? – Cyrus Apr 06 '21 at 18:50
  • 2
    Please add a few lines of your CSV file and what your "splitted" file names and contents should be – Fravadona Apr 06 '21 at 18:52
  • Further to @Fravadona's comment, readers would find it very helpful if you were to show the entire content of an example CSV file (including the header line, if there is one), the example made as as possible (in terms of the numbers of fields and lines) while retaining the structure of the file. Then provide an example of a file that is to be created. – Cary Swoveland Apr 06 '21 at 19:05
  • Hi y'all! I added a screenshot of the table. @cyrus sorry, I added that because some of the forums that have been helpful to me so far had it tagged (face palm). – amanda Apr 07 '21 at 16:14
  • A CSV is a text file. Open it with a basic text editor (like Notepad.exe on Windows) and put its first 3 lines here. Also, you want to create HTML files out of it, do you have an example? – Fravadona Apr 07 '21 at 16:33
  • @Fravadona Okay, I have added some lines (including the header) from the text file. I tried out the code from AndrejKostov and I think it's a step in the right direction – amanda Apr 07 '21 at 18:03
  • You're almost there :-) Now you need to show an example of the HTML that is to be generated. – Fravadona Apr 07 '21 at 20:19
  • I have that! It has been added to my post. – amanda Apr 08 '21 at 02:05
  • Then you got your answer. Don't forget to read the whole post because there are various things that are important to know. – Fravadona Apr 08 '21 at 03:07
  • @Fravadona This is super helpful/encouraging! I see what is going on: defining the template first, then going through the file. I used my file path = '/Users/amanda/Downloads/bee_key_fact_sheet.csv' but got the error message "TypeError: no implicit conversion of Hash into Integer" followed by a Ruby path with 'block in result' at the end. I read that this issue is supposed to be resolved as of version 2.6, so I'm going to see if ruby-build or ruby-install will help. Thank you for your note about tr("\0/",'') and about Ruby being the right language for this! Really encouraging :) – amanda Apr 08 '21 at 18:57

2 Answers2

0

Can you try this? It should be reading lines of file

f = File.new("name_of_file")
f.each_line { |line| puts line }

You can later save them as new file, more on that here: How to create a file in Ruby

0

Your example of CSV input (bee_key_fact_sheet.csv):

Genus,species,Common name,Status,Info,Time of year this bee is seen,Identification,Similar Species,Flowers,Sociality,Nest,Bug Guide,Discover Life,Other,
Agapostemon,melliventris,Honey-tailed Striped-Sweat bee,Secure G5,Excavates into deep burrows in ground nests,March-December,Agapostemon males have black and yellow stripes on the abdomen. Females have a yellow band on the lower margin of the clypeus.,All other Agapostemon species,Wide variety of plants,Solitary,"Deep, underground excavation",https://bugguide.net/node/view/70932,https://www.discoverlife.org/20/q?search=Agapostemon+melliventris,https://explorer.natureserve.org/Taxon/ELEMENT_GLOBAL.2.928401/Agapostemon_melliventris,
Agapostemon,sericeus,Silky Striped Sweat Bee,Secure G5,"Not choosy about lawn, as long as flowers are present",April-October,Agapostemon males have black and yellow stripes on the abdomen. A. sericeus males have a tooth on its hind femur. Female has metallic green abdomen.,All other Agapostemon species,Wide variety of plants,Solitary,Ground-nester in loamy soils,https://bugguide.net/node/view/83023,https://www.discoverlife.org/mp/20q?search=Agapostemon+sericeus,https://www.sharpeatmanguides.com/sweat-bees,
Agapostemon,splendens,Brown-winged Striped-Sweat Bee,Secure G5,This is the most common Agapostemon found in the southeast region,April-October,Agapostemon males have black and yellow stripes on the abdomen. A. splendens have brown wings. The female abdomen is often somewhat bluish.,All other Agapostemon species,"Jacquemontia reclinata, wide variety of plants",Solitary,Ground-nester in sandy soils,https://bugguide.net/node/view/74478,https://www.discoverlife.org/mp/20q?search=Agapostemon+splendens,,

In this CSV, all the lines (including the header) end with a comma, so the last column probably doesn't mean anything and is to be discarded.
Also, you have commas inside the data (fields with double-quotes), so you'll need a real CSV parser to read the content of the file. BTW, you're right in choosing Ruby for this task because it includes a CSV parser in its core library.

Here's one way of reading your CSV (Edit: fixed CSV#Row conversion for older Rubys):

require 'csv'
    
filepath = 'bee_key_fact_sheet.csv'
    
CSV.foreach(filepath, headers: true) do |row|
  genus, species = row[0], row[1]
  #data = row[0...-1] # NOTE: not sure about the Ruby version compatibility
  data = row.to_hash.values[0...-1]
    
  filename = "#{genus}_#{species}.txt".tr("\0/",'')
  filecontent = "  * #{data.join("\n  * ")}"
    
  puts "\n#{filename}:\n#{filecontent}"
end

About tr("\0/",''): The characters that are allowed in a filename depend on the filesystem. All the filesystems (that I know of) ban at least the NULL-byte and the slash characters, so I strip them (but you may want to strip a few more).

Question: What exactly is the expected HTML output? A table row?


Update: HTML generation

When generating content programmatically, it's fundamental to escape your data for the right format/language/context. In Ruby you can escape HTML with CGI.escapeHTML

Your example of HTML output:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
  </head>
  <body>
    <h1><em><!-- Species name --></em> - <!-- Common name --></h1>
    <h2>Status</h2>
    <p></p>
    <h2>Info</h2>
    <p></p>
    <h2>Time of year this bee is seen</h2>
    <p></p>
    <h2>Identification</h2>
    <p></p>
    <h3>Similar Species</h3>
    <p></p>
    <h2>Flowers</h2>
    <p></p>
    <h2>Sociality</h2>
    <p></p>
    <h2>Nest</h2>
    <p></p>
    <div id="refs" class="references">
      --
      <br>More information:
      <br> <!-- <a href="https://bugguide.net/node/view/70932">Bug Guide</a> -->
    </div>
  </body>
</html>

I'll make a few changes to the HTML:

  • Add a title to the page.
  • Remove MathJax which seams unnecessary.
  • Convert the <h3> tag to <h2> because you use it only for "Similar Species". Changing it also permits the use of a loop while generating the HTML.
  • You have 2 links in the CSV that you don't use in the HTML: "Discover Life" and "Other", don't you want to display them ? I added the code for that ;-)

OK, first, you create a function that, given a CSV row, generates the corresponding HTML. Here I use ERB templating but you can do it directly with string literals (Edit: fixed ERB#result arguments for Ruby < 2.4.0):

require 'cgi'
require 'erb'
    
def renderHTML row
  htmlsafe = row.each_with_object({}) { |(k,v),h| h[k] = CGI.escapeHTML v if v }
  template = <<-'EOF'
<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title><%= "#{htmlsafe['Genus']} #{htmlsafe['species']}" %></title>
  </head>
  <body>
    <h1><em><%= "#{htmlsafe['Genus']} #{htmlsafe['species']}" %></em> - <%= htmlsafe['Common name'] %></h1>
<% for key in ['Status','Info','Time of year this bee is seen','Identification','Similar Species','Flowers','Sociality','Nest'] %>
    <h2><%= key %></h2>
    <p><%= htmlsafe[key] %></p>
<% end %>
    <div id="refs" class="references">
      --
      <br>More information:
<% for key in ['Bug Guide', 'Discover Life', 'Other'].select{ |k| htmlsafe[k] } %>
      <br><a href="<%= htmlsafe[key] %>"><%= key %></a>
<% end %>
    </div>
  </body>
</html>
EOF
  #ERB.new(template, trim_mode: "<>").result(binding) # NOTE: only for Ruby >= 2.4.0
  ERB.new(template, nil, "<>").result(binding)
end

Then you can call the previous function while reading each row of your CSV file:

require 'csv'
    
filepath = 'bee_key_fact_sheet.csv'
    
CSV.foreach(filepath, headers: true) do |row|
  filename = "#{row['Genus']}_#{row['species']}.html".tr("\0/",'')
  html = renderHTML row
  puts "\n# #{filename}\n#{html}"
  #File.write(filename, html)
end

Note: I commented out the File.write line that will create the HTML files.

Fravadona
  • 13,917
  • 1
  • 23
  • 35