0

I want to read a tsv file and transform into specific pattern and convert the result into tsv. I tried doing the code in python using pandas. But I can not run it as it takes lot of memory.

I want to do the same thing in spark scala. But there is no melt function in scala.

My python code :

import pandas as pd
import string


dir = "related_path"

file = 'file_name.tsv'
   
    file_in = dir + file
    file_out= dir+'result.tsv'
    df = pd.read_csv(file_in,sep='\t')
    df1 = **df.melt(id_vars='Unnamed: 0')**
    df1.columns = ['col1', 'col2', 'col3']
    df1.index.name = 'index'
    print(df1)
    df1.to_csv(file_out, index=None, sep='\t', mode='a')

TSV does not contain header

Dataframe of tsv file (df) :

           Unnamed: 0    A-4    A-5      Unnamed: 3  A-12
index  
0             AB          NaN    0.019    NaN         0.10

1             AC         0.017  0.140     0.144       0.18

2             NaN        0.050  0.400     NaN         0.17

3             AE         0.890  0.240    0.450        0.13

Unnamed: 0    A-4    A-5      Unnamed: 3  A-12 (no header) is also a row

output dataframe(df1) :

      col1        col2   col3
index                        
0       AB         A-4    NaN

1       AC         A-4  0.017

2      NaN         A-4  0.050

3       AE         A-4  0.890

4       AB         A-5  0.019

5       AC         A-5  0.140

6      NaN         A-5  0.400

7       AE         A-5  0.240

8       AB  Unnamed: 3    NaN

9       AC  Unnamed: 3  0.144

10     NaN  Unnamed: 3    NaN

11      AE  Unnamed: 3  0.450

12      AB        A-12  0.100

13      AC        A-12  0.180

14     NaN        A-12  0.170

15      AE        A-12  0.130

df.melt(id_vars='Unnamed: 0') is the code for conversion into output dataframe

How to do it in scala as there is no built in melt function

complexity should not n^2

mck
  • 40,932
  • 13
  • 35
  • 50
Devashish
  • 1
  • 1
  • check [How to melt Spark DataFrame?](https://stackoverflow.com/questions/41670103/how-to-melt-spark-dataframe) – anky Jan 30 '21 at 05:19
  • Yes the post contains a spark melt user defined but for unnamed columns how to do it – Devashish Jan 30 '21 at 05:23
  • Just give your columns a name using `val df2 = df.toDF("col1","col2",...)` – mck Jan 30 '21 at 08:12
  • My output is only obtained if the first row are considered as col names.The expected output is not obtained by giving names to the columns – Devashish Jan 30 '21 at 11:42
  • So @Dee what you meant is it can't be implemented in scala , but can be implemented in pyspark. I need a melt function what I specified in python. It should take parameters for id_vars which does not have column names – Devashish Feb 01 '21 at 16:36

0 Answers0