0

I working on a text file that contains multiple information. I converted it into a list in python and right now I'm trying to separate the different data into different lists. The data is presented as following:

CODE/ DESCRIPTION/ Unity/ Value1/ Value2/ Value3/ Value4 and then repeat, an example would be:

P03133 Auxiliar helper un 203.02 417.54 437.22 675.80

My approach to it until now has been:

Creating lists to storage each information:

codes = []
description = []
unity = []
cost = []

Through loops finding a code, based on the code's structure, and using the code's index as base to find the remaining values.

Finding a code's easy, it's a distinct type of information amongst the other data. For the remaining values I made a loop to find the next value that is numeric after a code. That way I can delimitate the rest of the indexes:

  • The unity would be the code's index + index until isnumeric - 1, hence it's the first information prior to the first numeric value in each line.

  • The cost would be the code's index + index until isnumeric + 2, the third value is the only one I need to store.

  • The description is a little harder, the number of elements that compose it varies across the list. So I used slicing starting at code's index + 1 and ending at index until isnumeric - 2.

for i, carc in enumerate(txtl):
    if carc[0] == "P" and carc[1].isnumeric():
        codes.append(carc)
        j = 0
        while not txtl[i+j].isnumeric():
            j = j + 1
        description.append(" ".join(txtl[i+1:i+j-2]))
        unity.append(txtl[i+j-1])
        cost.append(txtl[i+j])

I'm facing some problems with this approach, although there will always be more elements to the list after a code I'm getting the error:

  while not txtl[i+j].isnumeric():
    txtl[i+j] list index out of range.

Accepting any solution to debug my code or even new solutions to problem.

OBS: I'm also going to have to do this to a really similar data font, but the code would be just a sequence of 7 numbers, thus harder to find amongst the other data. Any solution that includes this facet is also appreciated!

martineau
  • 119,623
  • 25
  • 170
  • 301
  • Can't you somehow convert your TXT file to CSV? CSV are very easy to read by any language. Perhaps this might be your solution? – JeanLColombo May 24 '21 at 12:01
  • I don't have experience with CSV files... Do you think it would help delimitate and manipulate each data separetely? – andrepadilha May 24 '21 at 12:06
  • That depends on your overall project. CSV means Comma-Separated-Values. There are a lot of routines and code already available to read and manipulate this file extension precisely. – JeanLColombo May 24 '21 at 12:13
  • 1
    I see, thanks for the tip. The only thing I'm wondering is how I would identify the different values in there. It would still be a challenge imo – andrepadilha May 24 '21 at 12:25
  • That's a little bit tricky. Do you know what the TXT files are? Meaning, where do they come from, what info do they have, etc... If not, what you want to do becomes very hard. Otherwise, if you know what type of data each row within your file contains, it becomes much easier. There are lotos of ways to extract those data to vectores, lists and matrices. – JeanLColombo May 25 '21 at 11:19

1 Answers1

0

A slight addition to your code should resolve this:

        while i+j < len(txtl) and not txtl[i+j].isnumeric():
            j += 1

The first condition fails when out of bounds, so the second one doesn't get checked.

Also, please use a list of dict items instead of 4 different lists, fe:

thelist = []
thelist.append({'codes': 69, 'description': 'random text', 'unity': 'whatever', 'cost': 'your life'})

In this way you always have the correct values together in the list, and you don't need to keep track of where you are with indexes or other black magic...

EDIT after comment interactions: Ok, so in this case you split the line you are processing on the space character, and then process the words in the line.

from pprint import pprint  # just for pretty printing


textl = 'P03133 Auxiliar helper un 203.02 417.54 437.22 675.80'
the_list = []

def handle_line(textl: str):
    description = ''
    unity = None
    values = []
    for word in textl.split()[1:]:
        # it splits on space characters by default
        # you can ignore the first item in the list, as this will always be the code
        # str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296
        if not word.replace(',', '').replace('.', '').isnumeric():
            if len(description) == 0:
                description = word
            else:
                description = f'{description} {word}' # I like f-strings
        elif not unity:
            # if unity is still None, that means it has not been set yet
            unity = word
        else:
            values.append(word)
    return {'code': textl.split()[0], 'description': description, 'unity': unity, 'values': values}

the_list.append(handle_line(textl))

pprint(the_list)    

str.isnumeric() doesn't work with floats, only integers. See https://stackoverflow.com/a/23639915/9267296

Edo Akse
  • 4,051
  • 2
  • 10
  • 21
  • This solves the while and thanks for the dicionary tip, but now the error just traveled to my append, when I try: thelist.append(txtl[i+j]) it comes out as out of range again... – andrepadilha May 24 '21 at 12:21
  • just for my understanding, a line is always `[CODE] [DESCRIPTION TEXT BLOCK] [UNITY] [LIST OF MORE VALUES]`, and UNITY is always the first numeric thing after the description text block? – Edo Akse May 24 '21 at 12:29
  • Unity is a string, usually un, m, month. It's the thing that precedes the first numeric value in each line. You got the rest right! – andrepadilha May 24 '21 at 12:33
  • one quick question, the end of the `[LIST OF MORE VALUES]` is the end of the actual line, right? – Edo Akse May 24 '21 at 12:38
  • Yes, it goes: [CODE] [DESCRIPTION TEXT BLOCK] [UNITY] [VALUE1] [VALUE2] [VALUE3] [VALUE4] and then starts again in a new line below. – andrepadilha May 24 '21 at 12:40
  • I won't be using this exact code, but there's a lot in there that was useful in my solution. Thank you! – andrepadilha May 24 '21 at 19:30