sort one giant string into 7 columns

Question

I have a file which I read in as a string. In sublime the file looks like this:

Filename
Dataset
Level
Duration
Accuracy
Speed Ratio
Completed
file_001.mp3
datasetname_here
value
00:09:29
0.00%
7.36x
2019-07-18
file_002.mp3
datasetname_here
value
00:22:01
...etc.

in Bash:

['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...etc.

I want to split this into a 7 column csv. As you can see, the values repeat every 8th line. I know I can use a for loop and modulus to read each line. I have done this successfully before.

How can I use pandas to read things into columns?

I don't know how to approach the Pandas library. I have looked at other examples and all seem to start with csv.

import sys 

parser = argparse.ArgumentParser()
parser.add_argument('file' , help = "this is the file you want to open")

args = parser.parse_args()
print("file name:" , args.file)

with open(args.file , 'r') as word:
    print(word.readlines())  ###here is where i was making sure it read in properly

###here is where I will start to manipulate the data

This is the Bash output:

['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...]

Possible duplicate of [Splitting rows into multiple columns in Pandas](https://stackoverflow.com/questions/46363314/splitting-rows-into-multiple-columns-in-pandas) — Renat, Aug 19 '19 at 08:50

score 2 · Accepted Answer · answered Aug 19 '19 at 08:55

First remove '\n':

raw_data = ['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', '0.01%\n', '7.39x\n', '2019-07-20\n']
raw_data = [string.replace('\n', '') for string in raw_data]

Then pack your data in 7-length arrays inside a big array:

data = [raw_data[x:x+7] for x in range(0, len(raw_data),7)]

Finally read your data as a DataFrame, the first row contains the name of the columns:

df = pd.DataFrame(data[1:], columns=data[0])
print(df.to_string())

       Filename           Dataset  Level  Duration Accuracy Speed Ratio   Completed
0  file_001.mp3  datasetname_here  value  00:09:29    0.00%       7.36x  2019-07-18
1  file_002.mp3  datasetname_here     L1  00:20:01    0.01%       7.39x  2019-07-20

Why have you put in the data in line one? ['Filename\n' , 'Dataset\n' ...] I will be loading in strings of arbitratry length ```raw_data = word``` Will this not work? — Morris Gevirtz, Aug 23 '19 at 08:54
The `raw_data` is your bash output but you can easily adapt this if you get the words one by one by just appending the words to a list. — SmileyProd, Aug 23 '19 at 09:31

score 1 · Answer 2 · answered Aug 19 '19 at 09:08

Try This

import numpy as np
import pandas as pd
with open ("data.txt") as f:
    list_str = f.readlines()
list_str = map(lambda s: s.strip(), list_str) #Remove \n
n=7
list_str = [list_str[k:k+n] for k in range(0, len(list_str), n)]
df = pd.DataFrame(list_str[1:])
df.columns = list_str[0]
df.to_csv("Data_generated.csv",index=False)

Pandas is not a library to read into columns. It supports many formats to read and write (One of them is comma separated values) and mainly used as python based data analysis tool. Best place to learn is see their documentation and practice.

Output of above code

score 0 · Answer 3 · answered Aug 19 '19 at 09:03

I think you don't have to use pandas or any other library. My approach:

data = []
row = []
with open(args.file , 'r') as file:
    for line in file:
        row.append(line)
        if len(row) == 7:
            data.append(row)
            row = []

How does it work?

The for loop reads the file line by line.
Add the line to row
When row's length is 7, it's completed and you can add the row to data
Create a new list for row Repeat

sort one giant string into 7 columns

3 Answers3

Output of above code