1

I am loading data into bigquery, the data should be in order when I upload it from csv, but once loaded it gets shuffled in bigquery, as below,

actual table: This how it should be in bigquery

id name location
1 aaa bbbb
2 ccc dddd
3 eeee
4 fff gggg

uploaded table: but in the below format it got updated once I uploaded from csv to bigquery

id name location
3 null eeee
2 ccc dddd
1 aaa bbbb
4 fff gggg

Even if I use pandas.sort_values(by='id'), I am not getting the table in correct order, it gets reshuffled, no idea on what basis.

What changes/steps to be done to get the actual table uploaded into the bigquery as it is in the same format and ignoring null values while loading data into the bigquery?

Thanks in advance

Beginner
  • 143
  • 1
  • 12

2 Answers2

2

A very important aspect of BigQuery is that

If an ORDER BY clause is not present, the order of the results of a query is not defined (ref here)

If you want to display the rows as you wish, simply add an ORDER BY clause:

ORDER BY `id`

As for ignoring the NaN values, just filter you dataframe with .notna() before using to_gbq.

Cylldby
  • 1,783
  • 1
  • 4
  • 17
  • I need those rows with null values, so I filled those null values with 0 and loaded the data into bigquery, but still those rows with 0 values comes infront. If I do, order by 'id' with or without ASC, it sort the table in ascending order based on both id and name column which is not at all acceptable. I have .csv file with null values filled with 0, if I loaded the file into bigquery table, expectation is whole data should be in order, unfortunately data gets shuffled(rows with 0 values comes infront.), pls suggest me with alternative way. – Beginner Mar 09 '22 at 06:19
  • 1
    `BigQuery` by default will not order the data, please see the documentation link provided in the answer. So, its not optimal to do sorting before inserting records into BigQuery. If you want to apply `order` to your records apply it when you pull out those records from BigQuery. In other words, you have to see your BigQuery Tables as a big pool of disorganize data which you are able pick up fast and that in the end you will have to make sense out of it. This behaviour is due the nature of BigQuery [query processing](https://cloud.google.com/bigquery/docs/query-overview#query_processing). – Betjens Mar 09 '22 at 09:36
  • 1
    @Beginner the `ORDER BY` needs to be made in BigQuery, not Pandas. Try to run ` SELECT * FROM ORDER BY id ` in BigQuery. It will be the best you can get – Cylldby Mar 09 '22 at 13:35
1

To complement Cylldby answer. If you still need that raw order, you can just add a column named index and populated with your index data from your dataframe into your table ( or working temp table )

df['index_col'] = df.index

Note: if there are multiple files, you can also set the custom index of your dataframe chunks. For more info there is this good answer which explores index reordering.

So, on you BigQuery table you can just order by index_col and preserver the raw order that got uploaded from your file.

Betjens
  • 1,353
  • 2
  • 4
  • 13