5

Amazon S3 has a new feature called select from which allows one to run simple SQL queries against simple data files - like CSV or JSON. So I thought I'd try it.

I created and uploaded the following CSV to my S3 bucket in Oregon (I consider this file to be extremely simple):

aaa,bbb,ccc
111,111,111
222,222,222
333,333,333

I indicated this was CSV with a header row and issued the following SQL:

select * from s3object s

...which worked as expected, returning:

111,111,111
222,222,222
333,333,333

Then I tried one of the provided sample queries, which failed:

select s._1, s._2 from s3object s

...the error message was "Some headers in the query are missing from the file. Please check the file and try again.".

Also tried the following, each time receiving the same error:

select aaa from s3object s
select s.aaa from s3object s
select * from s3object s where aaa = 111
select * from s3object s where s.aaa = 111
select * from s3object s where s._1 = 111

So anytime my query references a column, either by name or number, either in the SELECT or WHERE clauses, I get the "headers in the query are missing". The AWS documentation provides no follow up information on this error.

So my question is, what's wrong? Is there an undocumented requirement about the column headers? Is there an undocumented way to reference columns? Does the "Select From" feature have a bug in it?

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
Ken Krueger
  • 1,005
  • 14
  • 26

3 Answers3

4

I did the following:

  • Created a file with the contents you show above
  • Entered S3 Select on the file, and ticked File has header row
  • Changed no other settings

These queries did NOT work:

select s._1, s._2 from s3object s
select * from s3object s where s._1 = 111

The reason they didn't work is that the file contains headers, so the columns have actual names.

These queries DID work:

select aaa from s3object s
select s.aaa from s3object s
select * from s3object s where aaa = 111 (Gave empty result)
select * from s3object s where s.aaa = 111 (Gave empty result)

When I treated the last two queries as strings, they returned the row as expected:

select * from s3object s where aaa = '111'
select * from s3object s where s.aaa = '111'
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • They are returning the row as expected, but the header row is missing. Is there any way that the query results contain the header rows as well as the filtered data – user3807691 May 22 '20 at 05:59
0

Getting back to this, on a whim I decided to replace this sample file with a new identical example file, and now I do not encounter the problem. In fact, I'm unable to replicate the problem that I originally posted.

I have a few theories: character encoding, end-of-line character, and the possible presence of an extra line in my original file, but I have been unable to re-create the original issue.

I've tried different editors to create the source file, I've tried unix vs windows end of line characters, I've tried extra line on the end, I've tried upper case vs lower case column headers, and I've tried different regions. Everything works now, so I'm completely mystified as to why it did not work in the first place.

Life goes on. Thanks to everyone for your efforts.

Ken Krueger
  • 1,005
  • 14
  • 26
  • FWIW: I had the same issue. I have a header row. But I unchecked the 'file has header row', and things worked better. It returned the header in the result, but at least it worked. I suspect (with no evidence) that the backend had some kind of failure parsing the header row and exited early. – cmonkey May 05 '20 at 21:11
0

s3 select treats everything as string. The query

select * from s3object s where cast(aaa as int) = 111
select * from s3object s where cast(s.aaa as int) = 111

should return the expected results if the header rows are checked/unchecked appropriately.

buddemat
  • 4,552
  • 14
  • 29
  • 49
arp5
  • 169
  • 10