9

I'm trying to load the data of a csv file that is saved in GCS into BigQuery. The csv file is in the UTF-8 format and it contains 7 columns. I've specified these columns in the data scheme (all strings and nullable) and I've checked the contents of the csv file which seems fine.

When I try to load the data I get the following error:

Too many errors encountered. (error code: invalid) gs://gvk_test_bucket/sku_category.csv: CSV table references column position 1, but line starting at position:1750384 contains only 1 columns. (error code: invalid)

The weird thing is that the file only contains 680228 rows.

When I check the allow jagged lines options the table is being generated, but only the first column is filled with the entire comma separated string.

Can someone help me?

Example row

119470,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts,Long Sleeve Shirts

gvkleef
  • 315
  • 1
  • 4
  • 15
  • 1
    Have you checked line 1750384 in the file? If you set number of allowed errors to 1, does it load? Can you share a sample row? – Graham Polley Feb 22 '17 at 11:30
  • Yes, that row is empty, the file only contains 680228 rows. There is no option to say how many rows the file contains right? I've set the number to 1 and after that to 100, but then it will return the same error for another row. – gvkleef Feb 22 '17 at 11:55
  • Here is a sample row: 119470,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts,Long Sleeve Shirts – gvkleef Feb 22 '17 at 11:56
  • Best to clarify the question by editing the question itself in this case, not adding comments. A cut/paste of a row or two into *code sample* in the editor will work well. BTW.. the last time something like this happened to me, I had a comma in the actual field which gummed the works. – Paulb Feb 22 '17 at 12:59
  • Thanks for your comment. I've set the number of allowed errors to 1000000000 and after the that the table is being filled with data. Is there a way to avoid errors on empty rows after the last record? Furthermore I also saw that, even with Allow Quoted Newlines set to true, quoted values containing comma's are not included in the table. – gvkleef Feb 22 '17 at 14:51
  • Did you say the row is empty? Even if it is empty, you will need the correct number of delimiters to tell BQ that each field in the row is empty. BQ thinks your row only has one column if there is absolutely nothing in it. – REdim.Learning Feb 22 '17 at 15:11
  • position:1750384 means character number 1750384 not line. You could go to it f.e. in vscode using extension vscode-position – kondziorf Jan 19 '22 at 12:47

4 Answers4

12

For me, it was an issue with the presence of new line and carriage return characters, try replacing the special characters. I have replaced the characters using below code and it resolved the loading part.

df= df.applymap(lambda x: x.replace("\r"," "))
df= df.applymap(lambda x: x.replace("\n"," "))

I have used lambda function as I don't know which column is string in my case. If you are sure about columns then replace its column wise.

Try to replace the characters and it will work for you as well.

enle lin
  • 1,664
  • 8
  • 14
user8336233
  • 136
  • 1
  • 3
3

You cannot have empty rows in your file without delimiters, otherwise BigQuery (and pretty much every other ingest engine) will think it's just one column.

For example, this will fail on row 3 with the error you describe:

119470,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts,Long Sleeve Shirts

119471,Fashion,Fashion Own,Womenswear,Womensswear Brands Other,Formal Shirts,Long Sleeve Shirts

This will succeed:

119470,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts,Long Sleeve Shirts
,,,,,,,    
119471,Fashion,Fashion Own,Womenswear,Womensswear Brands Other,Formal Shirts,Long Sleeve Shirts
Graham Polley
  • 14,393
  • 4
  • 44
  • 80
  • Thanks for your comment. The rownumber with errors were bigger then the max row number in my excel. So there were no empty rows within the dataset. It seems BQ does not stop reading the data after the last row in my csv. – gvkleef Feb 28 '17 at 13:11
  • Sorry, I don't follow you. What do you mean? – Graham Polley Feb 28 '17 at 13:13
  • That for example my csv contains 80000 rows and I get errors on row 81000, 82500 etc. – gvkleef Mar 03 '17 at 05:50
  • So, fix those lines were you are missing the delimiters. – Graham Polley Mar 03 '17 at 06:17
  • @gvkleef I think there is a bug in the BigQuery response. The line number in the error is in fact the number of character before the error. – Marl Jan 19 '18 at 11:06
1

You either have an empty line

119470,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts

119472,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts

Or a line with quotes

119470,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts
"119471,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts"
119472,Fashion,Fashion Own,Menswear,Menswear Brands Other,Formal Shirts

I think there is a bug in the BigQuery response. The line number in the error is in fact the number of character before the error.

Marl
  • 1,366
  • 11
  • 10
  • how is this different to my answer? – Graham Polley Jan 20 '18 at 02:52
  • The question is also about the fact that the error indicate a row number higher than the total number of rows in the dataset. I tried to give a more complete response with more details. – Marl Jan 23 '18 at 10:52
0

In my case, I was facing this issue due to an extra blank line after the last row of data. try removing that extra line and it should work.

schan
  • 1