2

I am trying to split a txt file with multiple lines into separate variables. The text is an output of volume information with names, data sizes, etc. and I wan to split each dataset into a specific variable but can't seem to get it

Example is trying to split this data set into a variable for each item

/vol0                                abcd4     Object RAID6+  228.33 GB         --  400.00 GB  Online
/vole1                               abcd1     Object RAID6+   44.19 TB   45.00 TB   45.00 TB  Online
/vole2                               abcd4     Object RAID6+   11.27 TB   11.00 TB   12.00 TB  Online
/vol3                                abcd4     Object RAID6+    9.50 TB         --   10.00 TB  Online
/vol4                                abcd1     Object RAID6+   18.39 TB         --   19.10 TB  Online

This is the command I've run, but I keep getting an error about "not enough values to unpack".

inputfile = "dataset_input.txt"
with open(inputfile, "r") as input:
    for row in input:
        vol, bs, obj, raid, used, uunit, quota, qunit, q2, q2unit, status = row.split()

I can split the file by space just by doing the below text and it works. Just can't seem to get it into separate variables so I can manipulate the datasets

for row in input: #running through each row in the file
    output_text = row.split() #split the row based on the default white-space delimiter
    print(output_text)

I'm very new to python, so not sure if this is even possible, or how complicated it is

PM 77-1
  • 12,933
  • 21
  • 68
  • 111
  • 1
    Every iteration of your loop is going to reset the variables to whatever line you are iterating leaving you with variables set to the last line in your file. The reason your code doesn't work though is that `split()` returns a single list with the strings as elements. [you would need list comprehension or similar](https://stackoverflow.com/a/34654260/2221001) to do what you are trying to do. Reading this file into a dictionary or pandas dataframe would surely be a better route for whatever you are planning later on in this code. – JNevill Jan 27 '23 at 16:04
  • So the goal is to loop through the file line by line, separating by variables, and ultimately the piping standard output to a text file once the loop is complete. And then going back to the next line, printing, etc. At least that's my goal edit: Thanks for the link. I'll check it out – user21094545 Jan 27 '23 at 16:06
  • If you will be printing back out to `stdout` inside the loop then your `output_text = row.split()` is a reasonable way to handle this. `output_text` variable will contain a list that you can manipulate however you like. Once done manipulating you can `join()` it back together for printing like `print(' '.join(output_text))` which will stitch the list back into a string separated by spaces. – JNevill Jan 27 '23 at 16:09
  • At any rate, this sounds like an [XY problem](https://meta.stackexchange.com/a/66378/438222). It would help if you edit your question and explain fully what you are trying to accomplish (writing out to a new file a manipulated version of the file you are reading in). We can likely help get you to that end point. – JNevill Jan 27 '23 at 16:11
  • When you have empty values in columns - you will not have enough fields to assign values to all your variables. If possible change the format of your file into CSV, or pipe-delimited format (or similar). Alternatively put `--` into all columns that do not have values. – PM 77-1 Jan 27 '23 at 16:13
  • Yeah, ultimately my goal is to separate each datafield into a variable and then be able to only keep the 4-5 that I want. So then I can discard the useless information, and only keep vol, uunit, quota, and quota2, printed to my output I'm trying to convert an old perl script into python. This is a part of the original I'm trying to imitate # split out the data from the line. Split fields by space "my ($vol,$bs,$obj,$raid,$used,$uunit,$quota,$qunit,$q2,$q2unit,$status) = split(/\s+/,$_);" And after some variable manipulation: print "$vol,$gb_used,$gb_avail,$gb_size\n"; – user21094545 Jan 27 '23 at 16:15
  • your rows looks to be fixed length. Is that correct? – JonSG Jan 27 '23 at 16:35
  • The simplest approach is to accept the List from `mylist = row.split()` and obtain the elements you want using such as `var1 = mylist[0]` using whatever index value is appropriate 0 being for the first element. – user19077881 Jan 27 '23 at 16:49

4 Answers4

1

Firstly what you done is call split method which would split your rows into a list for every single space present in the string. That would provide you a list with a bigger length than the number of variables you have defined. This can only be solved by splitting into the correct number of variables you need.

Secondly in every for loop the same variable would be rewritten with new values thus losing the previous iteration value you can solve this by having the values appended into respective variable arrays

Here is a simple solution in which you first read the entire text file contents , preprocess it and store the processed content into required variable lists

fle=open("dataset_input.txt",'r')
txt=fle.readlines()

#adding another newline for patter homogenity
txt[-1]+='\n'

n=len(txt)

#remove new lines 
for i in range(0,n):txt[i]=txt[i][0:-1]

#trim multi spaces to #
import re
for i in range(0,n):
    txt[i]=re.sub('\s{2,}','#',txt[i])
    txt[i]=txt[i].split('#')

#define required variables
x1=[]
x2=[]
x3=[]
x4=[]
x5=[]
x6=[]
x7=[]

#adding the variable values to respective variables
for i in txt:
    x1.append(i[0])
    x2.append(i[1])
    x3.append(i[2])
    x4.append(i[3])
    x5.append(i[4])
    x6.append(i[5])
    x7.append(i[6])

print(x1,x2,x3,x4,x5,x6,x7)

Also note that it is possible to improve the code by combining the list appending in pre process stage itself depending on your life requirement of the main text file contents

Vishnu Balaji
  • 175
  • 1
  • 11
0

the error not enough values to unpack is produced when executing this line of code : vol, bs, obj, raid, used, uunit, quota, qunit, q2, q2unit, status = row.split(). the reason is that you are reading 11 separate elements from each row, though looking at the example you show, not every row contains 11 words separated by space. check this out :

with open(inputfile, "r") as input:
    for row in input:
        output = row.split()
        print("this row provides {} arguments".format(len(output)))
        print(output)

the output :

this row provides 10 arguments
['/vol0', 'abcd4', 'Object', 'RAID6+', '228.33', 'GB', '--', '400.00', 'GB', 'Online']
this row provides 11 arguments
['/vole1', 'abcd1', 'Object', 'RAID6+', '44.19', 'TB', '45.00', 'TB', '45.00', 'TB', 'Online']
this row provides 11 arguments
['/vole2', 'abcd4', 'Object', 'RAID6+', '11.27', 'TB', '11.00', 'TB', '12.00', 'TB', 'Online']
this row provides 10 arguments
['/vol3', 'abcd4', 'Object', 'RAID6+', '9.50', 'TB', '--', '10.00', 'TB', 'Online']
this row provides 10 arguments
['/vol4', 'abcd1', 'Object', 'RAID6+', '18.39', 'TB', '--', '19.10', 'TB', 'Online']

you need then to make some cleaning for you data-set, or maybe an if statement on the length would be helpful. looking at only the small portion of the data you provided i see that the mark "--" means that there is no volume. so you can replace the "--" mark with a couple of meaningful variables (value + unit) for example 0 and any unit. This is how you might do it:

with open(inputfile, "r") as input:
        for row in input:
            output = str(row).replace("--","0 0").split()
            print("this row provides {} arguments".format(len(output)))
            print(output)

and this would be the output

this row provides 11 arguments
['/vol0', 'abcd4', 'Object', 'RAID6+', '228.33', 'GB', '0', '0', '400.00', 'GB', 'Online']
this row provides 11 arguments
['/vole1', 'abcd1', 'Object', 'RAID6+', '44.19', 'TB', '45.00', 'TB', '45.00', 'TB', 'Online']
this row provides 11 arguments
['/vole2', 'abcd4', 'Object', 'RAID6+', '11.27', 'TB', '11.00', 'TB', '12.00', 'TB', 'Online']
this row provides 11 arguments
['/vol3', 'abcd4', 'Object', 'RAID6+', '9.50', 'TB', '0', '0', '10.00', 'TB', 'Online']
this row provides 11 arguments
['/vol4', 'abcd1', 'Object', 'RAID6+', '18.39', 'TB', '0', '0', '19.10', 'TB', 'Online']
0

It looks to me like your data is a list of fixed length records and rather than using split() you might take slices based on your fixed length fields. Ultimatley, I would look at implementing using pythons struct but this might get you started processing a fixed length record.

Let's start with some example data you read from your file and let's define a list of fixed length field specifications.

data = [
    "/vol0                                abcd4     Object RAID6+  228.33 GB         --  400.00 GB  Online",
    "/vole1                               abcd1     Object RAID6+   44.19 TB   45.00 TB   45.00 TB  Online",
    "/vole2                               abcd4     Object RAID6+   11.27 TB   11.00 TB   12.00 TB  Online",
    "/vol3                                abcd4     Object RAID6+    9.50 TB         --   10.00 TB  Online",
    "/vol4                                abcd1     Object RAID6+   18.39 TB         --   19.10 TB  Online"
]

##------------------------------
## Only you know for sure what the start and stop is of the fields in this fixed length record.
##------------------------------
fields = [
    {"name": "path", "starts_at": 0, "width": 37},
    {"name": "abc", "starts_at": 37, "width": 5},
    {"name": "type", "starts_at": 47, "width": 13},
    {"name": "size", "starts_at": 60, "width": 11},
    # ....
]
##------------------------------

Now, given your rows of data and the field definitions we can create a list of lists.

##------------------------------
## reshape as a list of lists
##------------------------------
data2 = [
    [
        row[field["starts_at"] : field["starts_at"] + field["width"]].strip()
        for field
        in fields
    ]
    for row
    in data
]
print(json.dumps(data2, indent=2))
##------------------------------

This should give you:

[
    ['/vol0', 'abcd4', 'Object RAID6+', '228.33 GB'],
    ['/vole1', 'abcd1', 'Object RAID6+', '44.19 TB'],
    ['/vole2', 'abcd4', 'Object RAID6+', '11.27 TB'],
    ['/vol3', 'abcd4', 'Object RAID6+', '9.50 TB'],
    ['/vol4', 'abcd1', 'Object RAID6+', '18.39 TB']
]

I myself would rather work with a list of dict if possible, so given the data and field definitions above, I might use them like this...

##------------------------------
## reshape as a list of dict
##------------------------------
data2 = [
    {
        field["name"]: row[field["starts_at"] : field["starts_at"] + field["width"]].strip()
        for field
        in fields
    }
    for row
    in data

]

import json # only for printing a nice output
print(json.dumps(data2, indent=2))
##------------------------------

Giving you:

[
  {
    "path": "/vol0",
    "abc": "abcd4",
    "type": "Object RAID6+",
    "size": "228.33 GB"
  },
  {
    "path": "/vole1",
    "abc": "abcd1",
    "type": "Object RAID6+",
    "size": "44.19 TB"
  },
  {
    "path": "/vole2",
    "abc": "abcd4",
    "type": "Object RAID6+",
    "size": "11.27 TB"
  },
  {
    "path": "/vol3",
    "abc": "abcd4",
    "type": "Object RAID6+",
    "size": "9.50 TB"
  },
  {
    "path": "/vol4",
    "abc": "abcd1",
    "type": "Object RAID6+",
    "size": "18.39 TB"
  }
]
JonSG
  • 10,542
  • 2
  • 25
  • 36
0

If you wanted to keep your original approach, something like this will cater for the error of sometimes having only 10 'columns' instead of the expected 11:

with open('dataset_input.txt') as f:
    lines = f.readlines()

for line in lines:
    line = line.strip().split()  # Remove white space and split by space, returns a list
    if line[6] == '--':
        # This means there is no quota value present
        # so insert another -- to correct the length ('columns') of the line to 11
        line.insert(6, '--')
    vol, bs, obj, raid, used, uunit, quota, qunit, q2, q2unit, status = tuple(line)
    # Perform any calculations and prints you want here 
    # PER LINE (each iteration will overwrite the variables above)
    # Note that all variables will be strings. So convert if required.

You can of course change the "--" to anything you want. e.g:

...
line.insert(6, '0')
...

and also change the "--" in the qunit as well if you wish:

...
line[6] = '0'
line.insert(6, '0')
...

On an unrelated side note, you have input as your file handle in your original code. input is a Python reserved keyword; these should be avoided when you choose any kind of identifier in your code.

bigkeefer
  • 576
  • 1
  • 6
  • 13