2

I have a dataframe like below:

import pandas as pd
data = {'Words':['actually','he','came','from','home','and','played'], 
        'Col2':['2','0','0','0','1','0','3']}
data = pd.DataFrame(data) 

The dataframe looks like this:

DataFrame looks like below

I write this dataframe into the drive using below command:

np.savetxt('/folder/file.txt', data.values,fmt='%s', delimiter='\t')

And the next script reads it with below line of code:

data = load_file('/folder/file.txt') 

Below is load_file function to read a text file.

def load_file(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        data = f.readlines()
    return data

The data will be a tab separated list.

print(data)

gives me the following output:

['actually\t2\n', 'he\t0\n', 'came\t0\n', 'from\t0\n', 'home\t1\n', 'and\t0\n', 'played\t3\n']

I dont want to write the file to drive and then read it for processing. Instead I want to convert the dataframe to a tab separated list and process directly. How can I achieve this?
I checked for existing answers, but most just convert list to dataframe and not other way around. Thanks in advance.

Varun kadekar
  • 427
  • 6
  • 15

3 Answers3

2

Try using .to_csv()

df_list = data.to_csv(header=None, index=False, sep='\t').split('\n')

df_list:

['actually\t2',
 'he\t0',
 'came\t0',
 'from\t0',
 'home\t1',
 'and\t0',
 'played\t3'
]

v = df.to_csv(header=None, index=False, sep='\t').rstrip().replace('\n', '\n\\n').split('\\n')

df_list:

['actually\t2\n',
 'he\t0\n',
 'came\t0\n',
 'from\t0\n',
 'home\t1\n',
 'and\t0\n',
 'played\t3\n'
]
Pygirl
  • 12,969
  • 5
  • 30
  • 43
  • 1
    thanks @pygirl... exactly what I needed :) – Varun kadekar Jan 29 '21 at 07:14
  • I get the last element as '' , and had to remove it. Not sure you faced the same , but atleast its not visible in the print you have given above – Varun kadekar Jan 29 '21 at 10:36
  • actually I removed it manually from the output. Yes I faced the same thing. Because in the last row I have added \\n. This should be avoided for the last value. You can create a loop to filter out the empty string – Pygirl Jan 29 '21 at 10:37
  • Ah... now that I see, Comma at last was indeed my mistake. But despite removing, I got empty element in the end. Something like this. ['actually\t2\r', 'he\t0\r', 'came\t0\r', 'from\t0\r', 'home\t1\r', 'and\t0\r', 'played\t3\r', ''] – Varun kadekar Jan 29 '21 at 10:42
  • Sorry. I was in hurry. Correction it's because of the last row containing `\n` One way of solving this is to use regex or you can filter it out using a loop I willupdate my answer :) – Pygirl Jan 29 '21 at 10:43
  • @Varunkadekar: I have updated my answer :) the string has `\n` in the end which can be removed easily by using `rstrip()` – Pygirl Jan 29 '21 at 10:48
  • thanks... I had excluded final element in next step, but this does the trick in one line. kudos :) – Varun kadekar Jan 29 '21 at 14:55
1

I think this achieves the same result without writing to the drive:

df_list = list(data.apply(lambda row: row['Words'] + '\t' + row['Col2'] + '\n', axis=1))
  • @Varunkadekar: Use of apply should be avoided :) just for info. Because they make the performance slow. https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code – Pygirl Jan 29 '21 at 10:50
  • thanks @Pygirl, really appreciate pointers you re throwing here. I was under the impression that apply was better than for loop... atleast in R – Varun kadekar Jan 29 '21 at 14:43
1

Try:

data.apply("\t".join, axis=1).tolist()
Lambda
  • 1,392
  • 1
  • 9
  • 11