0

I'm creating a project that tracks crowd-funding campaigns, so that the end user can analyze the data. Obviously I'm using a spider to scrape all of the details of each campaign at regular intervals, which I will then store in a database.

I'm just not sure how I should design the database. The issue is that campaigns can have lifespans of over a month, and I may be scraping each campaign multiple times a day to check for changes.

Lumping every campaign into a single table would be impractical since there will be thousands of campaigns and, theoretically, a single campaign might have hundreds and hundreds of rows if its details are constantly being updated. There will likely be tens and tens of columns too. Hence I've considered creating separate tables for each campaign.

At the same time, having thousands of tables also seems impractical, especially if the user wanted to compare a handful of different campaigns. In order to compare many campaigns, I would then have to query an indefinite number of tables.

I've never dealt with such complexity before. Does anybody have any idea how to approach this problem?

Potential Fields

CREATE TABLE campaign (
  id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  campaign_url VARCHAR(255) NOT NULL,
  campaign_phase VARCHAR(8) NOT NULL,
  project_website VARCHAR(255) NOT NULL,
  project_facebook_url VARCHAR(255) NULL,
  project_linkedin_url VARCHAR (255) NULL,
  project_twitter_url VARCHAR(255) NULL,
  project_youtube_url VARCHAR(255) NULL,
  product_title TEXT NOT NULL,
  product_tagline TEXT NOT NULL,
  product_phase VARCHAR(10) NULL,
  product_overview TEXT NULL, # may be more columns derived from overview...
  owner_name VARCHAR(255) NOT NULL,
  owner_title VARCHAR(255) NOT NULL,
  owner_description TEXT NULL,
  owner_avatar_url VARCHAR(255) NULL,
  owner_location VARCHAR(255) NOT NULL,
  owner_campaign_count TINYINT NOT NULL,
  owner_total_raised INT NOT NULL,
  owner_other_campaign_urls TEXT NOT NULL, # this may have multiple values...
  owner_contribution_count TINYINT NOT NULL,
  owner_verified BIT NULL,
  # info about team members...
  # info about perks...
  # info about/meta-analysis of campaign text, images, and videos...
  # info about updates...
  # info about backers...
  crawled_on DATETIME NOT NULL
)

Notably, I'm thinking about isolating the sections that are represented by comments since many of those sections may or may not contain information in various quantities. Also, the fields with VARCHAR(255) may need to be a different datatype.

oldboy
  • 5,729
  • 6
  • 38
  • 86
  • Do _not_ have thousands of "identical" tables. That Question has been asked repeatedly and universally put down. – Rick James Jul 18 '18 at 00:41
  • @RickJames it seems the only alternative is to have one table with hundreds of thousands or maybe even millions of rows?! wouldn't that be incredibly cumbersome – oldboy Jul 18 '18 at 03:29
  • Millions is a minor problem. Billions is when it gets more serious. Imagine the code mess you will have in constructing the queries based on the tablename. (You can't "bind" a tablename like you can a value.) – Rick James Jul 18 '18 at 05:57
  • @RickJames What do you mean exactly by 'bind'? Can't you do something like [this](https://stackoverflow.com/questions/27542617/dynamic-table-name-at-sql-statement) or [this](https://stackoverflow.com/questions/8809943/how-to-select-from-mysql-where-table-name-is-variable)? – oldboy Jul 18 '18 at 15:03

2 Answers2

1

(Partial answer)

Hundreds of columns similar to what you show is likely to be problematic. I recommend you consider splitting in a few ways.

  • "Team members" sounds like a list of people, not a single person. So, that is necessarily a separate table, joined 1:many. Similarly "images" sounds like an open-ended list.
  • Relative static data should (perhaps) be segregated from the frequently-updated data.
  • Figure out what your SELECTs will look like. If some of them look at, say, "product" columns but not "owner" columns, then it will probably be beneficial to split out one clump of columns or the other.

Hundreds, even Millions of rows in a single table is not a problem. Hundreds of columns in a single table is treading on thin ice.

Blindly using (255) is likely to bite you.

If you are crawling different sites, it is extremely likely that the format and composition of the data you get will vary from site to site. (I've done such with news sites -- it is a fulltime job.)

Bottom line: There is no simple, obvious, answer to your Question. You will have challenges.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • thanks for sharing. as i said in my answer, i'm likely going to isolate certain things, probably the sections i've commented out. also, i've deleted my comments in buddy's answer so that my question comment is a bit more visible, can you do the same? – oldboy Jul 18 '18 at 23:22
  • why are hundreds of columns an issue, whereas millions of rows are not? – oldboy Jul 18 '18 at 23:22
0

Stick with a normalized schema. One table for the content you describe is good, unless we are talking about some EXTREME amounts of data. IMHO, in the latter case MySQL would not be a very good choice anyway.

Keep it simple: design one table, choose correct datatypes, avoid NULL-able columns (you mentioned "tens and tens of columns", what for?) and index the data properly. You can't miss.

Boris Schegolev
  • 3,601
  • 5
  • 21
  • 34
  • @Anthony - Please give us an approximate `CREATE TABLE` for one of your thousands of tables. Otherwise, we are merely "hand waving". The presence or absence of big text columns matters; 100 columns matters; datatypes matter. – Rick James Jul 18 '18 at 16:30
  • can you expound upon what constitutes 'EXTREME amounts of data'? one column may end up having 3000+ words, 20000+ characters per row? one solution that just came to mind is that i can isolate certain columns in separate tables – oldboy Jul 18 '18 at 23:21
  • 1
    Databases have internal storage optimization mechanisms. Extra-wide rows are handled differently (example here: https://wiki.postgresql.org/wiki/TOAST). I don't know much about the latest MySQL, and it also depends a lot on storage engine, IIRC. As for table length, I've seen tables with ~250 million rows running OK on affordable hardware. – Boris Schegolev Jul 19 '18 at 07:36
  • thanks, but what is a rough guideline or limit for column width tho? – oldboy Jul 19 '18 at 18:20
  • `Varchar` limit is 65,535 chars in MySQL, so your data should fit fine. The other thing is that row length affects index sizes, memory consumption and possibly disk IO operations. You may consider using `text` instead of `varchar` for long rows. Some reading here: https://nicj.net/mysql-text-vs-varchar-performance/ – Boris Schegolev Jul 19 '18 at 20:15