2

I have list of 100 dataframes that I am trying to merge into a single dataframe but am unable to do so. All the dataframes have different columns and are of different lengths. To give a bit of context and background, each dataframe consist of 4 sentiment scores (calculated using VaderSentiment). The dataframes have the following representation :

USER 1 DATAFRAME

created_at       | positive score of user 1 tweets  |  negative score of user 1   tweets|    neutral score of user 1 tweets  | compound score of user 1 tweets |
23/2/2011 10:00  |           1.12                   |            1.3                    |                1.0                 |                  3.3            |
24/2/2011 11:00  |           1.20                   |            1.1                    |                0.9                 |                  2.5            |

USER 2 DATAFRAME

created_at       | positive score of user 1 tweets  |  negative score of user 1   tweets|    neutral score of user 1 tweets  | compound score of user 1 tweets |
25/3/2011 23:00  |           0.12                   |            1.1                    |                0.1                 |                  1.1            |
26/3/2011 08:00  |           1.40                   |            1.5                    |                0.4                 |                  1.5            |
01/4/2011 19:00  |           1.80                   |            0.1                    |                1.9                 |                  3.9            |

All the dataframes have one column in common, namely created_at. What I am trying to achieve is to merge all the dataframes based on the created_at column such that I get only one created_at column and all the other columns from all the other dataframes. The result should have **400* columns of sentiment scores and along with on created_at column.

My code is as follows :

import pandas as pd
import glob
import numpy as np
import os
from functools import reduce


path = r'C:\Users\Desktop\Tweets'
allFiles = glob.glob(path + "/*.csv")
list = []
frame = pd.DataFrame()

count=0

for f in allFiles:
    file = open(f, 'r')
    count=count+1
    _, fname = os.path.split(f)
    df = pd.read_csv(f)
    #print(df)
    list.append(df)

frame = pd.concat(list)
print(frame)

The problem is that when I run the code as above, I get the desired arrangement of columns, but instead of getting the values i get NaN in all the values, thus essentially having a dataframe with 401 columns out of which only the created_at column contains values

Any and all help is appreciated.

Thank you

EDIT

I have tried various different solutions to different questions posted here but none of them seem to work and thus as a last resort I have started this thread

EDIT 2

I have perhaps come up with a solution to my problem. Using the code below, I can append all the columns into frames. However, this creates a duplicate of created_at column which happens to be type object. If I could merge all the dates into one column, then my troubles would be much closer to being solved.

for f in allFiles :
file = open(f, 'r')
count=count+1
_, fname = os.path.split(f)
df = pd.read_csv(f)

dates = df.iloc[:,0]
neut = df.iloc[:,1]
pos = df.iloc[:,2]
neg = df.iloc[:,3]
comp = df.iloc[:,4]

all_frames.append(dates)
all_frames.append(neut)
all_frames.append(pos)
all_frames.append(neg)
all_frames.append(comp)

frame = pd.concat(all_frames,axis=1)

Any help would be appreciated

Stevi G
  • 257
  • 1
  • 4
  • 13
  • 1
    Possible duplicate of [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – G. Anderson Aug 14 '19 at 16:26
  • Bottom line, you should not be using `append()` for dataframes, this sounds like a strightforward `merge()`, or failing that, `concat()` – G. Anderson Aug 14 '19 at 16:27
  • Possible duplicate of [Efficient chain merge in pandas](https://stackoverflow.com/questions/40566541/efficient-chain-merge-in-pandas) – Ben.T Aug 14 '19 at 16:28
  • @G.Anderson I did use concat. The use of `append` in the code was a typing error – Stevi G Aug 14 '19 at 16:32
  • and in your case, it would be something like `pd.concat([pd.read_csv(f).set_index("created_at", drop=True) for f in allFiles], axis=1)` – Ben.T Aug 14 '19 at 16:33
  • @Ben.T the solution provided in [Efficient chain merge in pandas](https://stackoverflow.com/questions/40566541/efficient-chain-merge-in-pandas) does not work. I will try the above method in a sec. Edit In both cases get the error `ValueError: Shape of passed values is (32840, 400), indices imply (31234, 400)` – Stevi G Aug 14 '19 at 16:39
  • @SteviG is it possible that you have duplicated created_at within a same dataframe? – Ben.T Aug 14 '19 at 16:50
  • @Ben.T no it is not possible. I have checked all the dataframes and all of them have 5 columns as they should. A duplicated created_at would mean a 6th column being present – Stevi G Aug 14 '19 at 16:57
  • @SteviG then maybe with reduce: `reduce( (lambda x, y: pd.merge(x,y,on='created_at', how='outer')), list)`. – Ben.T Aug 14 '19 at 17:20
  • @Ben.T doesn't work. Gives memory error – Stevi G Aug 14 '19 at 17:27
  • I would suggest to convert the dataframes to parquet file format and then apply `mergeSchema` option on them. The schema is easy to merge if the files are in parquet format. However, it can't be done using pandas dataframe. You may have to use parquet-based library or spark for that. – ranka47 Aug 14 '19 at 21:13

1 Answers1

1

I strongly suggest you revist your data model. Having that many columns usually signals something is wrong. Having said that, here's one way to do it. Also list is a built-in data type. Don't override it with a variable name.

I assume that other than created_at, the columns from each file are unique.

all_frames = []
for f in allFiles:
    file = open(f, 'r')
    count=count+1
    _, fname = os.path.split(f)
    df = pd.read_csv(f, parse_dates=['created_at'], index_col='created_at')
    all_frames.append(df)

# This will create a dataframe of size n * 400
# n is the total number of rows between all files
frame = pd.concat(all_frames, join='outer', copy=False, sort=False)

# If you want to line up the hour across all users
frame.groupby(level=0)[frame.columns].first()
Code Different
  • 90,614
  • 16
  • 144
  • 163
  • You are right in assuming that apart _created_at_ column all the dataframes have unique columns. However, trying this, I get the same error as I originally get. I end up with **400** columns instead of **401** and except for created_at column, all the others have NaN values in them. Also `print(all_frames)` prints all the 100 dataframes with **4** columns instead of 5. – Stevi G Aug 15 '19 at 06:22
  • I have edited my original post to update it with a **_partial_** solution. I need to be able to merge 100 created_at columns into one. If I can do that, then I may be able to solve the problem – Stevi G Aug 15 '19 at 07:10