2

I have several txt files that are formatted in this way

label1: value1 label2: string1 date: 2018-06-26 label3: value2 label4: string

I would like to read those files and create a database where I have headers and then values/strings which then I write to file. any help? regards

bruvio
  • 853
  • 1
  • 9
  • 30
  • Use pandas read_csv to read the text files and then merge all these into one dataframe – min2bro Jun 27 '18 at 11:41
  • Can you provide an example of what you have tried so far? – vielkind Jun 27 '18 at 11:49
  • dataset_cormat = pd.read_csv('cormat_out.txt', delimiter=" ", header=None, names=["shot", "user", "date",'seq','written by']), but it's not what I want as it cannot divide the data according to headers i set – bruvio Jun 27 '18 at 11:54
  • What is the separator between columns? Is it just space like between column name and value, or that is tab? If it is different from just space, you might find here the answer https://stackoverflow.com/questions/38366494/how-to-read-text-files-key-value-pair-using-pandas (just change | to tab and = to : ) – Leonid Mednikov Jun 27 '18 at 12:07

2 Answers2

2

Looks like you have a mapping between identifier labels and values. You can convert this into a dictionary via standard Python:

from io import StringIO

mystr = StringIO("""label1: value1 label2: string1 date: 2018-06-26 label3: value2 label4: string""")

# replace mystr with open('file.csv', 'r')
with mystr as fin:
    data = next(fin).strip().split()
    data_dict = {i[:-1]: j for i, j in zip(data[::2], data[1::2])}

print(data_dict)

{'date': '2018-06-26',
 'label1': 'value1',
 'label2': 'string1',
 'label3': 'value2',
 'label4': 'string'}

From here there are many options depending on the exact format you want to output your data, e.g. pandas, csv, etc. You need to provide more details for help with this step, but first you should investigate these options:

jpp
  • 159,742
  • 34
  • 281
  • 339
1

if data is exactly similar to this:

Age: 39 Name: Jack date: 2018-06-26 Region: NY Open: Yes
Age: 21 Name: Rose date: 2018-09-16 Region: TX Open: NO

You need to split texts based on the SPACES in the lines.

import pandas as pd

f=open('D:\\1.txt','r')
datalist=[]
dlabels=[]
for line in f:
    words = line.split(' ')
    words[-1] = words[-1][:-1]
    if len(dlabels)==0:
        for i in range(0,len(words),2):
            dlabels.append(words[i][:-1])
    tempL=[]
    for i in range(0,len(words),2):
        tempL.append(words[i+1])
    datalist.append(tempL)        
f.close()

data=pd.DataFrame(datalist,columns=dlabels)
print(data)  

output:
Age Name date Region Open
0 39 Jack 2018-06-26 NY Yes
1 21 Rose 2018-09-16 TX NO

Hamid Mir
  • 363
  • 1
  • 9
  • thanks @DataScienceStep that worked. I just had to edit the name of the label has it had a space. I am able to create dataFrame! – bruvio Jun 27 '18 at 13:16