0

I am learning SQL, and am trying to work with some sample data to understand various concepts.

Here are what the first few lines of the Yahoo! Search Marketing Advertiser Bid-Impression-Click data on competing Keywords data look like:

1   08bade48-1081-488f-b459-6c75d75312ae    2   2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0   2.0 0.0
29  08bade48-1081-488f-b459-6c75d75312ae    3   769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0   1.0 0.0
29  08bade48-1081-488f-b459-6c75d75312ae    2   769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0   1.0 0.0
11  08bade48-1081-488f-b459-6c75d75312ae    1   769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0   2.0 0.0
76  08bade48-1081-488f-b459-6c75d75312ae    2   769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0   1.0 0.0
48  08bade48-1081-488f-b459-6c75d75312ae    3   2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0   2.0 0.0
97  08bade48-1081-488f-b459-6c75d75312ae    2   2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0   1.0 0.0
123 08bade48-1081-488f-b459-6c75d75312ae    5   769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a    100.0   1.0 0.0
119 08bade48-1081-488f-b459-6c75d75312ae    3   2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0   1.0 0.0
73  08bade48-1081-488f-b459-6c75d75312ae    1   2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a    100.0   1.0 0.0

Please note that this data is only available from Yahoo! on request.

Here is the description of the data:

(1) "ydata-ysm-keyphrase-bid-imp-click-v1_0.gz" contains the following fields:

    0 day
    1 anonymized account_id
    2 rank
    3 anonymized keyphrase (expressed as list of anonymized keywords)
    4 avg bid
    5 impressions
    6 clicks

    Primary key of the data is a combination of fields date, account_id, rank and keyphrase. Average bid, impressions and
    clicks information is aggregated over the primary key.

I am trying to load this data from a flat file into a MySQL table using the load data infile statement.

In general, I think that I understand how to specify the data types of the columns, but am not sure about 3 anonymized keyphrase (expressed as list of anonymized keywords):
Question 1: Should I specify them as separate varchar columns, or is there a datatype that allows them to be stored together as "list" type?


Here is the query that I currently have to create the table to hold this data.

# create a new database
create database webscopedata;
show databases;
use webscopedata;

# create the table
drop table bidders;
create table bidders (
    daynum int, 
    account_id varchar(40), 
    rank int, 
    keyphrase1 varchar(100),
    keyphrase2 varchar(100),
    keyphrase3 varchar(100),
    keyphrase4 varchar(100),
    keyphrase5 varchar(100), 
    avg_bid double, 
    impressions double, 
    clicks double);

Now when I try to run the query

load data infile "ydata-ysm-keyphrase-bid-imp-click-v1_0" into table bidders fields terminated by "\t";

That is, when I specify the tab separator, I get the error:

Error Code: 1261. Row 1 doesn't contain data for all columns

This leads me to believe that the fields are not properly delimited when specifying the tab delimiter. So I try to specify multiple delimiters using the query:

load data infile "ydata-ysm-keyphrase-bid-imp-click-v1_0" into table bidders fields terminated by "' '\t";

Here I am trying to use multiple field terminators, but that does not seem to work, and I get an error:
Error Code: 1265. Data truncated for column 'daynum' at row 1

Question 2: How can I specify parsing of this data with multiple delimiters?

tchakravarty
  • 10,736
  • 12
  • 72
  • 116
  • Since row 1 seems to have a valid number 1 for the first column (the value 1), it could probably be an error with the data parsing. You might want to check if you are using the right delimiters, because the error `1265` can also occur when value is `null`. – psiyumm Jan 17 '15 at 11:49
  • Regarding question 1, refer to [this](http://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad) discussion. And also considering the specification `Primary key of the data is a combination of fields date, account_id, rank and keyphrase`, I think keyphrase should be a single column. Which brings us to question 2, since you know the pattern of the data file, rather than specifying multiple delimiters, just merge the subset of columns in your parser. – psiyumm Jan 17 '15 at 11:56
  • @activatedGeek 1. How can I check if I am using the right delimiters? 2. I don't understand what you mean by "merge the subset of columns in your parser". – tchakravarty Jan 17 '15 at 12:20
  • 1. For checking the right delimiters, you should print some subset of data to the console (or some other output stream like file). In your case since the error is in row 1 itself, it should be easier to check that your parser has valid values for row 1 right away. 2. Since, I think `keyphrase` should be a single value, rather than the 5-split you have done, I would simply merge columns 4-8 to a single column (after the first pass of parsing the data file) to satisfy the original `Primary Key` constraint. – psiyumm Jan 17 '15 at 12:52
  • @activatedGeek 1. The lines in question were printed exactly as you mention using a call to `head`. How can we establish that MySQL is parsing it right and what the delimiters are? – tchakravarty Jan 17 '15 at 12:59

1 Answers1

0

You can reformat the data file using perl regular expression, and then try to import it using mysql load data feature.

Use following perl script to reformat the data file

#!/usr/bin/perl -w
use strict;
open(FH, "<", "ydata-ysm-keyphrase-bid-imp-click-v1_0");
my ($line, $data) = ();
while(<FH>){
$line = $_;
$line = qq($line);
$line=~s/(\s+)/"/g;
$line=~s/(\s*)"$//;
$data .= $line."\n";
}
close FH;

open(FH, ">", "DataFile");
print FH $data;
close FH;

And then execute following statement in mysql

load data infile "pathToYourDataFile/DataFile" into table bidders fields terminated BY '"';

Hope, this solves your problem of importing data into mysql.

  • Well, the dataset is 8.5 GB, and I am not sure that this is a scalable option. I would really like to understand how to do this using SQL only. – tchakravarty Jan 18 '15 at 04:29