How can I union multiple datasets into one whole big dataset with python spark?

Question

I want to make one big dataset from 17 different csv files. Each contains like 200k rows and same columns. So what I want to do is just create one single dataframe so I can work with it later.

I tried to look for SQL joins but it seems they require one ID to join. the datasets doesn't single IDs.

you don't want to join on a column? so do you want to append data from all csv files? check this [link](https://stackoverflow.com/questions/37332434/concatenate-two-pyspark-dataframes) — pyofey, Sep 07 '19 at 00:19
I saw that questions earlier but didn't pay attention to one of the answers... it seems the `df_concat = df.union(df2)` will do. Just it takes only one dataset at a time. tks @pyofey — Tacio Degrazia, Sep 07 '19 at 00:31
`from functools import reduce from pyspark.sql import DataFrame dfs = [df1,df2,df3] df = reduce(DataFrame.unionAll, dfs)` This did the trick... now I have one big dataset. — Tacio Degrazia, Sep 07 '19 at 00:38
Possible duplicate of [How to import multiple csv files in a single load?](https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load). Instead of using join or union, simply read all the csv files at ones to create a single dataframe. — Shaido, Sep 07 '19 at 07:35

score 0 · Accepted Answer · answered Sep 07 '19 at 12:15

If you want to create a big data frame out of the 17 csv files which have same columns,

Use glob() to list your files
Use a generator expression to read files
Use concat() method to combine them
Write the new dataframe into a new csv file.

Try this :

import pandas as pd
from glob import glob

all_csv_files = glob('csv_folder/*.csv')
df = pd.concat((pd.read_csv(csv_file) for csv_file in all_csv_files), ignore_index=True)
df.to_csv('final_csv.csv', index=False)

How can I union multiple datasets into one whole big dataset with python spark?

1 Answers1