4

This question is referring to the previous post

The solutions proposed worked very well for a smaller data set, here I'm manipulating with 7 .txt files with a total memory of 750 MB. Which shouldn't be too big, so I must be doing something wrong in the process.

df1  = pd.read_csv('Data1.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14])
df2  = pd.read_csv('Data2.txt', skiprows=0, delimiter=' ', usecols=[1,2, 5, 7, 8, 10, 12, 13, 14])
df3  = ...
df4 = ...

This is how one of my dataframes (df1) look like - head:

  name_profile depth           VAR1  ...  year  month  day
0  profile_1   0.6           0.2044  ...  2012     11  26
1  profile_1   0.6           0.2044  ...  2012     11  26
2  profile_1   1.1           0.2044  ...  2012     11  26
3  profile_1   1.2           0.2044  ...  2012     11  26
4  profile_1   1.4           0.2044  ...  2012     11  26
...

And tail:

       name_profile     depth              VAR1  ...  year  month  day
955281  profile_1300   194.600006          0.01460  ...  2015      3  20
955282  profile_1300   195.800003          0.01095  ...  2015      3  20
955283  profile_1300   196.899994          0.01095  ...  2015      3  20
955284  profile_1300   198.100006          0.00730  ...  2015      3  20
955285  profile_1300   199.199997          0.01825  ...  2015      3  20

I followed a suggestion and dropped duplicates:

df1.drop_duplicates()
...

etc.

Similarly df2 has VAR2, df3 VAR3 etc.

The solution is modified according to one of the answers from the previous post.

The aim is to create a new, merged DataFrame with all VARX (of each dfX) as additional columns to the depth, profile and other 3 ones, so I tried something like this:

dfs = [df.set_index(['depth','name_profile', 'year', 'month', 'day']) for df in [df1, df2, df3, df4, df5, df6, df7]]

df_merged = (pd.concat(dfs, axis=1).reset_index())

The current error is:

ValueError: cannot handle a non-unique multi-index!

What am I doing wrong?

PEBKAC
  • 748
  • 1
  • 9
  • 28
  • 1
    You don't need Dask for this, the file size is trivial for any modern system. – roganjosh Apr 12 '19 at 17:00
  • but what can the problem be then? – PEBKAC Apr 12 '19 at 17:00
  • Do you mind to clarify if the file are 750 MB in disk or in RAM? – rpanai Apr 12 '19 at 17:01
  • 1
    `reduce` is a very intensive process as it nests with each iteration. Use `concat` instead . – Parfait Apr 12 '19 at 17:01
  • It's not possible to say without seeing the pandas code but something is not right at all – roganjosh Apr 12 '19 at 17:01
  • 1
    The problem is here: `dfs2 = [dfs1, df3]`. `dfs1` is, itself, a list of dataframes. You perhaps wanted to `extend` the list or `append` to it, not nest it – roganjosh Apr 12 '19 at 17:06
  • gotcha! OK so say I try to use @Parfait's solution and use `concat`instead. I'll modify the question accordingly now Give me a second. I get an error of ValueError: cannot handle a non-unique multi-index! – PEBKAC Apr 12 '19 at 17:08
  • 1
    Once again, de-dupe your data on the keys with `drop_duplicates(...)` or run an aggregation to pick first pairing `groupby(...).first()` – Parfait Apr 12 '19 at 17:09
  • @Parfait, I used the drop_duplicates, but it still gives me the same error. I'm sorry, I'm quite new at this. How should I regroup the datagframes then, what exactly do you mean? I'm sorry, I didn't quite understand the second part of your suggestion – PEBKAC Apr 12 '19 at 17:16
  • 1
    Please show your data, attempted code, and errors/undesired results. – Parfait Apr 12 '19 at 17:25
  • @Parfait, there you go. I just edited the post to show you what the data looks like – PEBKAC Apr 12 '19 at 17:50
  • Is there a reason you have duplicate *profile* and *depth* on same year, month, day? You omit columns. Also, please read [docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.htm). I mentioned to subset on keys (not all columns -the default). – Parfait Apr 12 '19 at 18:00
  • yes, those are measurements and sometimes sensors have multiple measurements on a particular depth quota – PEBKAC Apr 12 '19 at 18:02
  • Look at docs and see *how* you want to de-dupe. You do not want to drop important rows. Aggregate with `groupby` if you need `max`, `min`, `mean`, etc. – Parfait Apr 12 '19 at 18:02
  • I just want to merge these docs, each has measurements of one variable. They are at different depth quotas, so instead of having 7 files, I want to merge them into one file with all depth quotas, and correspondingly assign `NaN` where a certain variable wasn't measured. It's a very, very big file with lots of rows, so there are bound to be duplicates, thus this is not the suitable solution. I am not looking for mean, min max, but for a MERGED document, based on 7 separate ones. What should I do? – PEBKAC Apr 12 '19 at 18:05
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/191757/discussion-between-pebkac-and-parfait). – PEBKAC Apr 12 '19 at 18:06
  • @roganjosh, I tried. But I don't know how to do it otherwise . If I use this: `dfs = [df1,df2,df3, df4, ...] merge = partial(pd.merge, on=['depth','profile'], how='outer') reduce(merge, dfs)`, that's when it gives me the Memory Error. Do you have any other suggestion? – PEBKAC Apr 12 '19 at 18:26

1 Answers1

1

Consider again using the horizontal concatenation with pandas.concat. Because you have multiple rows sharing same profile, depth, year, month, and day, add a running count cumcount into mult-index, calculated with groupby().cumcount():

grp_cols = ['depth', 'name_profile', 'year', 'month', 'day']

dfs = [(df.assign(grp_count = df.groupby(grp_cols).cumcount())
          .set_index(grp_cols + ['grp_count'])
       ) for df in [df1, df2, df3, df4, df5, df6, df7]]

df_merged = pd.concat(dfs, axis=1).reset_index()

print(df_merged)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • I think this might be it, yes!!! THANK YOU SO MUCH! I'll take a closer look tomorrow, now I'm KO, but that might be it! What's the point of this `cumcount` function by the way? – PEBKAC Apr 12 '19 at 19:47
  • 1
    Please read my opening text where `cumcount` is to resolve your repeat profile/depth rows. – Parfait Apr 12 '19 at 20:16