0

I want to make one big dataset from 17 different csv files. Each contains like 200k rows and same columns. So what I want to do is just create one single dataframe so I can work with it later.

I tried to look for SQL joins but it seems they require one ID to join. the datasets doesn't single IDs.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • you don't want to join on a column? so do you want to append data from all csv files? check this [link](https://stackoverflow.com/questions/37332434/concatenate-two-pyspark-dataframes) – pyofey Sep 07 '19 at 00:19
  • I saw that questions earlier but didn't pay attention to one of the answers... it seems the `df_concat = df.union(df2)` will do. Just it takes only one dataset at a time. tks @pyofey – Tacio Degrazia Sep 07 '19 at 00:31
  • 2
    `from functools import reduce from pyspark.sql import DataFrame dfs = [df1,df2,df3] df = reduce(DataFrame.unionAll, dfs)` This did the trick... now I have one big dataset. – Tacio Degrazia Sep 07 '19 at 00:38
  • 2
    Possible duplicate of [How to import multiple csv files in a single load?](https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load). Instead of using join or union, simply read all the csv files at ones to create a single dataframe. – Shaido Sep 07 '19 at 07:35

1 Answers1

0

If you want to create a big data frame out of the 17 csv files which have same columns,

  1. Use glob() to list your files
  2. Use a generator expression to read files
  3. Use concat() method to combine them
  4. Write the new dataframe into a new csv file.

Try this :

import pandas as pd
from glob import glob

all_csv_files = glob('csv_folder/*.csv')
df = pd.concat((pd.read_csv(csv_file) for csv_file in all_csv_files), ignore_index=True)
df.to_csv('final_csv.csv', index=False)
Arkistarvh Kltzuonstev
  • 6,824
  • 7
  • 26
  • 56