4

I have a bunch of files stored in S3 in CSV format (no header) but in many cases only one record per file. for example:

"6ad0638e-e7d3-4c33-8271-5b3972c6155f",1532653200000

When I run crawler it creates for each file separated table.

Question(s):

  • How to force a crawler to use a single (already created) table?
  • Do I need to create a custom classifier? If my field names are rId and ts, can somebody give me Grok file example?

Thanks

Vladimir Ilic
  • 71
  • 1
  • 7

3 Answers3

3

I contacted AWS Support and here are details:

Problem is caused by the files which have a single record. By default Glue crawler used LazySimpleSerde to classify CSV files. LazySimpleSerde needs at least one newline character to identify a CSV file which is its limitation.

The right path to solve this issue is by considering the use of Grok pattern.

In order to confirm this, I have tested some scenarios at my end, with your data and custom pattern. I have created 3 files name file1.csv with one record, file2.csv with two records and file3.csv with one record. Also, proper Grok pattern should consider new lines as well with $ i.e.

%{QUOTEDSTRING:rid:string},%{NUMBER:ts:long}$
  1. I ran the crawler without any custom pattern on all the files and it created multiple tables.
  2. I edited the crawler and added the custom pattern and re-ran the same crawler but it still created multiple tables.
  3. I created a new crawler with Grok pattern and ran it on file1 and file2, it only created one table with proper columns.
  4. I added file3 and ran the crawler again it only updated the same table and no new tables got created.
  5. I have tested the scenario 3 and 4 using partitions in S3(as you might have partitioned data) and still got one table.

As per my observations, it seems that the problem might be due to the crawler caching the older classification details. So I'd request you to create a new crawler and point it to a new database in the catalog.

Vladimir Ilic
  • 71
  • 1
  • 7
  • I know it's no compensation because you still have to deal with them, but if possible the files should not be generated that way. It's a posix convention that a text file has one or more lines, and that a line is terminated by a newline character, so these files don't technically contain any lines see https://stackoverflow.com/q/729692/1335793 – Davos Aug 16 '19 at 04:42
1

I have same "issue". Documentation (Adding Classifiers to a Crawler) says:

Built-In CSV Classifier

To be classified as CSV, the table schema must have at least two columns and two rows of data. It would be great if there's way to force it to understand a single row.

  • I found that too. That is why I'm wondering will building custom classifier help... But I cannot find a single complete example of the custom classifier :-(. Right now I'm focusing on converting files to json (works in case of single record) and/or adding header to all CSV files. – Vladimir Ilic Aug 02 '18 at 14:53
0

Did you try setting "Create a single schema for each S3 path." as true in crawler configuration? If this field is set, crawler doesn't create a new schema but update the existing one. Please refer to the link for more details.

https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-grouping-policy

Sumit Saurabh
  • 1,366
  • 1
  • 19
  • 33