How do I pull data from multiple CSV files without multiple loops?

Question

I have two CSV files: "users" and "enrollments":

001.csv:

user_id,user_name,state
12345,test_account,active

002.csv:

course_id,user_id,state
67890,12345,active

I need to create one file like active_enrollments.csv:

course_id,user_name
67890,test_account

Without looping through the files multiple times, how can I parse these to generate the file active_enrollments.csv?

Here is what I have so far, but I am getting a lot of duplicates:

require 'csv'

CSV.open("active-enrollments.csv", "wb") do |csv|
  csv << ["course_id", "user_name", "user_id","course_name", "status"]
end
Dir["csvs/*.csv"].each do |file|
  #puts file
CSV.foreach(file, :headers => true) do |row|
if row['user_id'] && row ['course_id'] #finds enrollment csvs
  if row['state'] == "active" #checks for active enrollments
    state = row['state']
    course_id = row['course_id']
    user_id = row['user_id']
    Dir["csvs/*.csv"].each do |files|
      CSV.foreach(files, :headers => true) do |user|
        if user['user_name']
          if user_id == user['user_id']
            user_name = user['user_name']
            Dir["csvs/*.csv"].each do |file|
              CSV.foreach(file, :headers => true) do |courses|
                if course_id == courses['course_id']
                  course_name = courses['course_name']
                  CSV.open("active-enrollments.csv", "a") do |csv|
                    csv << [course_id, user_name, user_id, course_name, state]
                  end
                end 
              end
            end
          end
        end
      end
    end
  end
end
end
end

I know this is simple, but I can't seem to get it without looping through the files multiple times and generating lots of duplicates.

found it! http://stackoverflow.com/questions/11127948/pulling-a-value-from-one-csv-based-on-a-value-in-another?rq=1 — user2390847, Jul 09 '13 at 05:23
This is a great use for a database. I'd recommend looking into [SQLite](http://www.sqlite.org) as a database, and [Sequel](http://sequel.rubyforge.org/documentation.html) as an interface/ORM to SQLite. Look through the first couple documents for Sequel to see how easily it lets you set up a trivial database for this purpose. — the Tin Man, Jul 09 '13 at 05:41

score 2 · Answer 1 · answered Jul 09 '13 at 05:15

In lieu of using a database or a bunch of full-fledged models, I would suggest using a simple Hash as a lookup.

The following has not been tested and I left out all filters.

Separate the user from the enrollment csvs by name and iterate once over the user csvs to create a lookup by user_id.

users_csvs = Dir['csvs/users-*.csv']
enrollment_csvs = Dir['csvs/enrollment-*.csv']

users = {} 
users_csvs.each do |user_file|
  CSV.foreach(user_file, :headers => true) do |row|
    # Put in whatever data you will need later
    users[row['user_id']] = {:user_name => row['user_name'], :state => row['state']}
  end
end

consolidated_csv = []
enrollment_csvs.each do |enrollment_file|
  CSV.foreach(enrollment_file, :headers => true) do |row|
    user_id = row['user_id']
    if user = users[user_id]
      # Put in whatever you want from the two objects
      consolidated_csv << {:course_id => row['course_id'], :user_name => row['user_name']}
    end
  end
end

CSV.open("active-enrollments.csv", "wb") do |csv|
   csv << ['course_id', 'user_name']
   consolidated_csv.each { |row| csv << [row[:course_id], row[:user_name]] }
end

While it's a start, this would break down quickly on large CSV files in the GB range. The OP doesn't give us a hint about the size of the files currently used, or how big they could grow, but it's not very scalable because it's holding everything in RAM. — the Tin Man, Jul 09 '13 at 05:51
@theTinMan you are absolutely correct. There are plenty of more complicated strategies like fetching n rows at a time, serializing the lookup hash as json perhaps, etc — Matt Dressel, Jul 09 '13 at 06:25

score 1 · Answer 2 · answered Jul 09 '13 at 03:37

1

It would probably be easier to use Sqlite, pull the data from the CSV files, stick it in a temporary database, then query the db to generate the final output.

answered Jul 09 '13 at 03:37

Sean Redmond

3,974
22
28

score 0 · Answer 3 · answered Jul 09 '13 at 06:51

Here's some sample code showing how to do this using a simple SQLite database and the Sequel ORM:

require 'csv'
require 'sequel'

DB = Sequel.sqlite(File.dirname(__FILE__) + '/temp.db')

# user_id,user_name,state
# 12345,test_account,active
DB.create_table :csv1 do
  primary_key :id
  Integer :user_id
  String :user_name
  String :state
end

TABLE_001 = DB[:csv1]
CSV.foreach('001.csv', :headers => :first_row) do |row|
  TABLE_001.insert(
    :user_id   => row['user_id'],
    :user_name => row['user_name'],
    :state     => row['state']
  )
end

# course_id,user_id,state
# 67890,12345,active
DB.create_table :csv2 do
  primary_key :id
  Integer :course_id
  Integer :user_id
  String :state
end

# I need to create one file like active_enrollments.csv:
#
#     course_id,user_name
#     67890,test_account
TABLE_002 = DB[:csv2]
CSV.foreach('002.csv', :headers => :first_row) do |row|
  TABLE_002.insert(
    :course_id => row['course_id'],
    :user_id   => row['user_id'],
    :state     => row['state']
  )
end

CSV.open('active_enrollments.csv', 'w') do |csv_out|
  TABLE_001.each do |row_001|
    row_002 = TABLE_002.where(:user_id => row_001[:user_id]).first
    csv_out << [row_002[:course_id], row_001[:user_name]]
  end
end

After running, "active_enrollments.csv" contains:

67890,test_account

This is a very scalable solution.

Running this twice will get an error because Sequel will try to generate new tables in the database. Erase the file, or add exception handlers for both create_table blocks.

How do I pull data from multiple CSV files without multiple loops?

3 Answers3