0

This is a looklike example of I data I have, but with much less lines.

So imagine I have a txt file like this:

'''
Useless information 1
Useless information 2
Useless information 3
Measurement:
Len. (cm)   :length of the object
Hei. (cm)   :height of the object
Tp.         :type of the object
~A DATA
10  5   2
8   7   2
5   6   1
9   9   1
'''

and I would like to put the values below '~A DATA' as a DataFrame. I already managed to get the DataFrame without column names (although it got a little messy as there are lines nonsense in my code) as you can see:

with open(r'C:\Users\Lucas\Desktop\...\text.txt') as file:
    for line in file:
        if line.startswith('~A'):
           measures = line.split()[len(line):]
           break

    df = pd.read_csv(file, names=measures, sep='~A', engine='python')

newdf = df[0].str.split(expand = True)

newdf()
    0  1  2
0  10  5  2
1   8  7  2
2   5  6  1
3   9  9  1

Now, I would like to put 'Len', 'Hei' and 'Tp' from the text as column names on the DataFrame. Just these measurement codes (without the consequent strings). How can I do that to have a df like this?

    Len  Hei  Tp
  0  10   5   2
  1   8   7   2
  2   5   6   1
  3   9   9   1

One of the solutions would be to split every line below the string 'Measurement' (or beginning with the line 'Len...') till every line above the string '~A' (or ending with line 'Tp'). And then split every line we got. But I don't know how to do that.

user026
  • 638
  • 4
  • 14
  • `df.columns = ['Len','Hei','Tp']` – Tserenjamts Nov 06 '19 at 04:03
  • Does this answer your question? [Renaming columns in pandas](https://stackoverflow.com/questions/11346283/renaming-columns-in-pandas) – Rick Nov 06 '19 at 04:06
  • Sorry, guys. I need to get the column names from the strings of the text as the original file has thousands lines and I can't do this writing one by one. – user026 Nov 06 '19 at 14:41

1 Answers1

2

Solution 1: If you want to scrape the the column names from the text file itself, than for that, you need to know, from which line the column name information is starting, and then read the file line-by-line and do the processing for the particular lines which you know have column names as text.

To answer you the specific question that you asked, let's assume variable line contains one of the strings, say line = Len. (cm) :length of the object, you could do regex based splitting, wherein, you split across any special symbol except digits and alphabets.

import re
splited_line = re.split(r"[^a-zA-Z0-9]", line) #add other characters which you don't want
print(splited_line)

This results in

['Len', ' ', 'cm', '   ', 'length of the object']

Further, to get the column name, you pick the first element from the list as splited_line[0]

Solution 2: If you already know the column names, you could just do

df.columns = ['Len','Hei','Tp']

Here is the complete solution to what you are looking for:

In [34]: f = open('text.txt', "rb") 
    ...: flag = False 
    ...: column_names = [] 
    ...: for line in f: 
    ...:     splited_line = re.split(r"[^a-zA-Z0-9~]", line.decode('utf-8')) 
    ...:     if splited_line[0] == "Measurement": 
    ...:         flag = True 
    ...:         continue 
    ...:     elif splited_line[0] == "~A": 
    ...:         flag = False 
    ...:     if flag == True: 
    ...:         column_names.append(splited_line[0]) 
Anant Mittal
  • 1,923
  • 9
  • 15
  • I appreciate for your time, @Anant Mittal. But I still need a different solution. The answer should be that I need to split every line below the string 'Measurement' (or beginning with the line 'Len...') till every line above the string '~A' (or ending with line 'Tp'). And then do your split for every line we got. Could you help me? – user026 Nov 06 '19 at 14:38
  • Yep. It would be a tremendous help ;) – user026 Nov 06 '19 at 17:21
  • Yeah. These lines aren't important – user026 Nov 06 '19 at 18:08
  • Give me sometime, I am at work. Just confirm, this is the exactly file structure right? ```Measurement: Len. (cm) :length of the object Hei. (cm) :height of the object Tp. :type of the object ~A DATA``` This is exactly what file has? – Anant Mittal Nov 07 '19 at 04:30
  • Yeah. Imagine the lines from "Useless Information 1" to "~A DATA" as the header of the file and the numbers below them as the values I want. And then I need to select and split the lines beginning with "Len...", "Hei..." and "Tp..." to convert the first strings in column names for a DataFrame. – user026 Nov 07 '19 at 12:41
  • Hi @LucasA, please check the solution. You may not require utf-8 conversion depending on how data is saved, remove in case required. Let me know if this works for you. – Anant Mittal Nov 08 '19 at 06:54
  • Also, if it works for you, kindly accept the solution. – Anant Mittal Nov 08 '19 at 09:07
  • That is it, sir! I'm very grateful for your help and the time you spent. Gratitude. – user026 Nov 08 '19 at 20:49