I'm creating a project that tracks crowd-funding campaigns, so that the end user can analyze the data. Obviously I'm using a spider to scrape all of the details of each campaign at regular intervals, which I will then store in a database.
I'm just not sure how I should design the database. The issue is that campaigns can have lifespans of over a month, and I may be scraping each campaign multiple times a day to check for changes.
Lumping every campaign into a single table would be impractical since there will be thousands of campaigns and, theoretically, a single campaign might have hundreds and hundreds of rows if its details are constantly being updated. There will likely be tens and tens of columns too. Hence I've considered creating separate tables for each campaign.
At the same time, having thousands of tables also seems impractical, especially if the user wanted to compare a handful of different campaigns. In order to compare many campaigns, I would then have to query an indefinite number of tables.
I've never dealt with such complexity before. Does anybody have any idea how to approach this problem?
Potential Fields
CREATE TABLE campaign (
id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
campaign_url VARCHAR(255) NOT NULL,
campaign_phase VARCHAR(8) NOT NULL,
project_website VARCHAR(255) NOT NULL,
project_facebook_url VARCHAR(255) NULL,
project_linkedin_url VARCHAR (255) NULL,
project_twitter_url VARCHAR(255) NULL,
project_youtube_url VARCHAR(255) NULL,
product_title TEXT NOT NULL,
product_tagline TEXT NOT NULL,
product_phase VARCHAR(10) NULL,
product_overview TEXT NULL, # may be more columns derived from overview...
owner_name VARCHAR(255) NOT NULL,
owner_title VARCHAR(255) NOT NULL,
owner_description TEXT NULL,
owner_avatar_url VARCHAR(255) NULL,
owner_location VARCHAR(255) NOT NULL,
owner_campaign_count TINYINT NOT NULL,
owner_total_raised INT NOT NULL,
owner_other_campaign_urls TEXT NOT NULL, # this may have multiple values...
owner_contribution_count TINYINT NOT NULL,
owner_verified BIT NULL,
# info about team members...
# info about perks...
# info about/meta-analysis of campaign text, images, and videos...
# info about updates...
# info about backers...
crawled_on DATETIME NOT NULL
)
Notably, I'm thinking about isolating the sections that are represented by comments since many of those sections may or may not contain information in various quantities. Also, the fields with VARCHAR(255)
may need to be a different datatype.