0

I have a file which I read in as a string. In sublime the file looks like this:

Filename
Dataset
Level
Duration
Accuracy
Speed Ratio
Completed
file_001.mp3
datasetname_here
value
00:09:29
0.00%
7.36x
2019-07-18
file_002.mp3
datasetname_here
value
00:22:01
...etc.

in Bash:

['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...etc.

I want to split this into a 7 column csv. As you can see, the values repeat every 8th line. I know I can use a for loop and modulus to read each line. I have done this successfully before.

How can I use pandas to read things into columns?

I don't know how to approach the Pandas library. I have looked at other examples and all seem to start with csv.

import sys 

parser = argparse.ArgumentParser()
parser.add_argument('file' , help = "this is the file you want to open")

args = parser.parse_args()
print("file name:" , args.file)

with open(args.file , 'r') as word:
    print(word.readlines())  ###here is where i was making sure it read in properly

###here is where I will start to manipulate the data

This is the Bash output:

['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...]
Arkistarvh Kltzuonstev
  • 6,824
  • 7
  • 26
  • 56
  • Possible duplicate of [Splitting rows into multiple columns in Pandas](https://stackoverflow.com/questions/46363314/splitting-rows-into-multiple-columns-in-pandas) – Renat Aug 19 '19 at 08:50

3 Answers3

2

First remove '\n':

raw_data = ['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', '0.01%\n', '7.39x\n', '2019-07-20\n']
raw_data = [string.replace('\n', '') for string in raw_data]

Then pack your data in 7-length arrays inside a big array:

data = [raw_data[x:x+7] for x in range(0, len(raw_data),7)]

Finally read your data as a DataFrame, the first row contains the name of the columns:

df = pd.DataFrame(data[1:], columns=data[0])
print(df.to_string())

       Filename           Dataset  Level  Duration Accuracy Speed Ratio   Completed
0  file_001.mp3  datasetname_here  value  00:09:29    0.00%       7.36x  2019-07-18
1  file_002.mp3  datasetname_here     L1  00:20:01    0.01%       7.39x  2019-07-20
SmileyProd
  • 788
  • 4
  • 13
1

Try This

import numpy as np
import pandas as pd
with open ("data.txt") as f:
    list_str = f.readlines()
list_str = map(lambda s: s.strip(), list_str) #Remove \n
n=7
list_str = [list_str[k:k+n] for k in range(0, len(list_str), n)]
df = pd.DataFrame(list_str[1:])
df.columns = list_str[0]
df.to_csv("Data_generated.csv",index=False)   

Pandas is not a library to read into columns. It supports many formats to read and write (One of them is comma separated values) and mainly used as python based data analysis tool. Best place to learn is see their documentation and practice.

Output of above code

enter image description here

Hima
  • 11,268
  • 3
  • 22
  • 31
0

I think you don't have to use pandas or any other library. My approach:

data = []
row = []
with open(args.file , 'r') as file:
    for line in file:
        row.append(line)
        if len(row) == 7:
            data.append(row)
            row = []

How does it work?

  • The for loop reads the file line by line.
  • Add the line to row
  • When row's length is 7, it's completed and you can add the row to data
  • Create a new list for row Repeat
bb1950328
  • 1,403
  • 11
  • 18