1

Hi guys how are you? I hope you just fine! How to parse a text file extracting specific values using index positions, append the values to a list, then convert it to pandas dataframe. So far I was to able write the below code: TEXT SAMPLE:

header:0RCPF049100000084220210407
body:1927907801100032G 00sucess
1067697546140032G 00sucess
1053756666000032G 00sucess
1321723368900032G 00sucess
1037673956810032G 00sucess

For example, the first line is the header, and from it, I just need the date which is in the following index position: date_from_header = linhas[0][18:26] The rest of the values is in body

import csv
import pandas as pd

headers = ["data_mov", "chave_detalhe", "cpf_cliente", "cd_clube",
           "cd_operacao","filler","cd_retorno","tc_recusa"]

# This is the actual code
with open('RCPF0491.20210407.1609.txt', "r")as f:
  linhas = [linha.rstrip() for linha in f.readlines()]
  for i in range(0,len(linhas)):
     data_mov = linhas[0][18:26]
     chave_detalhe=linhas[1][0:1]
     cpf_cliente=linhas[1][1:12]
     cd_clube=linhas[1][12:16]
     cd_operacao=linhas[1][16:17]
     filler=linhas[1][17:40]
     cd_retorno=linhas[1][40:42]
     tx_recusa=linhas[1][42:100]
data = [data_mov,chave_detalhe,cpf_cliente,cd_clube,cd_operacao","filler,cd_retorno,tc_recusa]

The intended result looks like this:

data_mov chave_detalhe cpf_cliente cd_clube cd_operacao filler cd_retorno tx_recusa
'20210407' '1'         92790780110 '0032'   'G'        'blank space' '00'   'sucesso'
'20210407' '1'         92790780110 '0032'   'G'        'blank space' '00'   'sucesso'
'20210407' '1'         92790780110 '0032'   'G'        'blank space' '00'   'sucesso'
Jayron Soares
  • 461
  • 4
  • 11
  • This question is a bit hard to follow. Could you: post an example of filename.txt ? – SamBob Apr 09 '21 at 13:08
  • But already looking at your code : your `for loops` repeat the same thing (reading lines 0 and 1 from the filename.txt) over and over again (as you don't use the iterator variable, `i` inside the loops) – SamBob Apr 09 '21 at 13:10
  • But I expect your data is likely a csv or similar, and pandas has a function for reading that: `read_csv`. See: https://www.datacamp.com/community/tutorials/pandas-read-csv – SamBob Apr 09 '21 at 13:12
  • @SamBob thanks I'm trying to figure out how to loop over the file and extract all values according to the indexes positions – Jayron Soares Apr 09 '21 at 13:25
  • Ah, so you are trying to extract data_mov from the first line, and then "chave_detalhe", "cpf_cliente", "cd_clube", "cd_operacao","filler","cd_retorno","tc_recusa" from each of the other lines? Ignoring the first line for now, does https://stackoverflow.com/a/10851479/1581658 help for splitting up the lines? – SamBob Apr 09 '21 at 13:34
  • @SamBob that is right, the date is unique for each file, it is in the header, the rest of the information is in the body. Thanks for the link – Jayron Soares Apr 09 '21 at 13:44

2 Answers2

1

Using stackoverflow.com/a/10851479/1581658

def parse_file(filename):
    indices = [0,1,12,16,17,18,20] # list the indices to split on
    parsed_data = [] # returned array by line
    with open(filename) as f:
        header = next(f) #skip the header
        data_mov = header[18:26] # and get data_mov from header
        for line in f: #loop through lines
            #split each line by the indices
            parts = [data_mov] + [line.rstrip()[i:j] for i,j in zip(indices, indices[1:]+[None])]
            parsed_data.append(parts)
    return parsed_data

print(parse_file("filename.txt"))
SamBob
  • 849
  • 6
  • 17
0

I thanks the help of SamBob, following the final solution in case anyone needs:

import itertools
import pandas as pd

pd.options.display.width = 0

def parse_file(filename):
    indices=[0,1,12,16,17,18,42]  # list of indexes
    parsed_data = [] # return a list
    with open(filename) as f:
        header = next(f) 
        data_mov = header[18:26]
        for line in itertools.islice(f,1,100): 
            # dividr de acordo com os índices.
            parts = [data_mov] + [line.rstrip()[i:j] for i,j in zip(indices, indices[1:]+[None])]
            parsed_data.append(parts)
            
            # convert to dataframe
            cols = ['data_mov', 'chave_detalhe', 'cpf_cliente','cd_clube','cd_operacao','filler','cd_retorno','tx_recusa']
            df = pd.DataFrame(parsed_data, columns=cols)

    return df


df = (parse_file("filename.txt"))
Jayron Soares
  • 461
  • 4
  • 11