0

I have a file with lines look like this:

"[36.147315849999998, -86.7978174] 6 2011-08-28 19:45:11 @maryreynolds85 That is my life, lol."

"[37.715399429999998, -89.21166221] 6 2011-08-28 19:45:41 Ate more veggie and fruit than meat for the first time in my life"

i have tried to strip these lines and split them, then i tried to strip substring in every list with punctuations.

 with open('aabb.txt') as t:
        for Line in t:
            splitline = Line.strip()  
            splitline2 = splitline.split()  
            for words in splitline2:
                words = words.strip("!#$%&'()*+,-./:;?@[\]^_`{|}~")
                words = words.lower()

what shoul I do to turn these lines into two lists look like this:

'["36.147315849999998","-86.7978174","6","2011-08-28","19:45:11","maryreynolds85","that","is","my","life","lol"]'

'["37.715399429999998","-89.21166221","6","2011-08-28","19:45:41","ate","more","veggie","and","fruit","than","meat","for","the","time","in","my","life"]'
Saleem Ali
  • 1,363
  • 11
  • 21
jane998
  • 11
  • 2
  • I don't know enough about python, but should you use something from this : [Read a file line-by-line with python](https://stackabuse.com/read-a-file-line-by-line-in-python/) and mix it with the function `list = line.split(" ")` – pensum Nov 05 '19 at 05:11
  • You're trying to read a TSV (Tab-Separated Value) file, which generically refers to whitespace-separated input (not just tabs). It also contains `[...]` brackets. – smci Nov 05 '19 at 05:11
  • Variable names should generally follow the `lowercase_with_underscores` style. – AMC Nov 05 '19 at 05:13
  • Related: [parsing a tab-separated file in Python](https://stackoverflow.com/questions/11059390/parsing-a-tab-separated-file-in-python) – smci Nov 05 '19 at 05:13
  • Where do these strings come from? What’s the general format, context, etc? – AMC Nov 05 '19 at 05:21

2 Answers2

2

are all your data in the same format? if yes, use regex from re library.

import re
your_str="[36.147315849999998, -86.7978174] 6 2011-08-28 19:45:11 @maryreynolds85 That is my life, lol."
reg_data= re.compile(r"\[(.*),(.*)\] (.*)")
your_reg_grp=re.match(reg_data,your_str)
if your_reg_grp:
  print(your_reg_grp.groups())

#this should put everything in the list except the parts outside the square brackets, you can split the last one by split(" ") then make a new list.

grp1=your_reg_grp.groups()
grp2=grp1[-1].split(" ")

Combine grp1[:-1] and grp2

Atreyagaurav
  • 1,145
  • 6
  • 15
  • 2
    Adding to @Atreyagaurav, the following RegEx is more explicit: https://regex101.com/r/QRux5E/1 – jayg_code Nov 05 '19 at 05:18
  • 1
    Nice one, That seems to be useful, I didn't want to spend too much time in figuring the exact regex so I made a general one. – Atreyagaurav Nov 05 '19 at 05:20
  • @Atreyagaurav thank you for your help. I tried your answer but it seems like there are some puncuations are missed.like, ['6', '2011-08-28', '19:11:58', 'wahhhhhh', 'i', 'need', 'to', 'figure', 'out', 'what', 'to', 'do', 'wifff', 'my', 'life', '#lost']. the "#' infront of the word"lost' are supposed to be removed. could you show me how to solve the problem in that case? im kinda new to python. thank you for your help again. – jane998 Nov 09 '19 at 03:05
  • if such puntuation are in start or end, your code `words.strip("!#$%&'()*+,-./:;?@[\]^_`{|}~")` should work fine, use it for each item in your group, or write a function for that. If they can also be in the miiddle then you can write a function to remove those characters, shouldn't be hard. – Atreyagaurav Nov 10 '19 at 04:53
-1

You are already creating words that you need on the list. You have to just create a list and add it to the list.

with open('aabb.txt') as t:
        for Line in t:
            list=[]
            splitline = Line.strip()  
            splitline2 = splitline.split()  
            for words in splitline2:
                words = words.strip("!#$%&'()*+,-./:;?@[\]^_`{|}~")
                words = words.lower()
                list.append(words)
            print(list)

You can also create a list of list for each line and use it for your needs.

with open('aabb.txt') as t:
        root_list=[]
        for Line in t:
            temp_list=[]
            splitline = Line.strip()  
            splitline2 = splitline.split()  
            for words in splitline2:
                words = words.strip("!#$%&'()*+,-./:;?@[\]^_`{|}~")
                words = words.lower()
                temp_list.append(words)
            root_list.append(temp_list)
        print(root_list)