0

I have a list like this:

list1=['weather:monday="severe" weather:friday, xxx:sunday="calm" xxx:sunday="high severe", yyy:friday="rainy" yyy:saturday=']

what I want is to result in dataframe like this:

column1   column2    column3
weather   Monday     severe
weather   Friday     
xxx       Sunday     calm
xxx       Sunday     high severe
yyy       Friday     rainy
yyy       Saturday   

First, in the list, I tried the following:

newlist2 = [word for line in list1 for word in line.split(':')]
newlist2

['weather',
 'monday="severe" weather',
 'friday, xxx',
 'sunday="calm" xxx',
 'sunday="high severe", yyy',
 'friday="rainy" yyy',
 'saturday=']

and

newlist3 = [word for line in newlist2 for word in line.split('=')]
newlist3

['weather',
 'monday',
 '"severe" weather',
 'friday, xxx',
 'sunday',
 '"calm" xxx',
 'sunday',
 '"high severe", yyy',
 'friday',
 '"rainy" yyy',
 'saturday',
 '']

After that I convert the list into a dataframe

df=pd.Dataframe(newlist3)

However, the outcome is not the desired one.

Any ideas on how to reach my desired outcome?

xavi
  • 80
  • 1
  • 12

2 Answers2

0

First, I would clean the data, something like:

cleaned = [x.replace('"', '').replace('high severe', 'high_severe') for x in list1]

Then do those steps you already did:

newlist2 = [word for line in cleaned for word in line.split(':')]
newlist3 = [word for line in newlist2 for word in line.split('=')]

And add a fourth step:

newlist4 = [word for line in newlist3 for word in line.split(' ')]

To make these steps more concise you could look into re.split as shown here.

Which yields the correct list. You want this list to be in three chunks, one for each column. You could use a function like this:

def divide_chunks(l, n):
     
    # looping till length l
    for i in range(0, len(l), n):
        yield l[i:i + n]



pd.DataFrame(list(divide_chunks(newlist4,3)))
>>>
 0         1             2
0   weather    monday        severe
1   weather    friday      harmful,
2  weather1    sunday          calm
3  weather1    sunday  high_severe,
4  weather2    friday         rainy
5  weather2  saturday        cloudy
DSteman
  • 1,388
  • 2
  • 12
  • 25
  • hi, thank you for your asnwer. However I have made a mistake in my original list. On weather and weather 2 I do not have values on column3. Given your answer on the dataframe now weather1 on index 2 goes to column2 at index1. Can you please adjust your code? – xavi Aug 23 '22 at 08:58
  • also can you please explain the code on the divide_chunks function? – xavi Aug 23 '22 at 08:59
  • It's explained in the link. Maybe try to adjust the code yourself? You'll have to insert some placeholder value for the cells that you want to keep empty. Otherwise there is no way to tell which cells should be empty. – DSteman Aug 23 '22 at 09:06
0

Here is a way you can do it: I still don't know if you only have these few possible inputs for weather or if it can be anything. In case it is only these few, here's an updated version of it. You need to change xxx and yyy to the actual names in your data.

# split the string everytime before the word 'weather' appears, with filter(None,...) filter empty elements
lst = list(filter(None, [word.strip(' ,').replace('"','') for line in list1 for word in re.split(r"(?=weather|xxx|yyy)", line)]))

#prepare data to have 'data' as lists of lists, where each list represents a row
data = []
for elem in lst:
    weather, other = elem.split(':')
    if '=' in other:
        day, forecast = other.split('=')
    else:
        day = other
        forecast = ''
    data.append([weather, day, forecast])

df = pd.DataFrame(data, columns= ['weather', 'day', 'forecast'])
print(df)

Output:

   weather       day     forecast
0  weather    monday       severe
1  weather    friday             
2      xxx    sunday         calm
3      xxx    sunday  high severe
4      yyy    friday        rainy
5      yyy  saturday               
Rabinzel
  • 7,757
  • 3
  • 10
  • 30
  • thank you for your answer. The list I provided does not reflect all the information on my actual list. Some words do not start with the weather but with other names like xxx or yyy. I updated my list in the question. Can you adjust your answer, please? – xavi Aug 23 '22 at 09:39
  • do you know how the words start? If it is a combination of 3,4 words we can still use my approach, if there are unlimited random words for the weather column, the approach of the other answer (also like you did) will be better. – Rabinzel Aug 23 '22 at 09:55
  • the problem here is with--->>>weather, other = elem.split(':'). I get back error: to many items to unpack expected two – xavi Aug 23 '22 at 10:47
  • Propably in your data, de first split doesn't work properly, so in the next step splitting with delimiter `:` you get a list of at least 3 elements, but only unpack 2 of them. – Rabinzel Aug 23 '22 at 11:03
  • @xavi You change your input data every time someone gives you an answer.. – DSteman Aug 23 '22 at 14:27
  • @DSteman Yes, I know. I did not realize the "hidden" aspects of my dataset. However, this is not going to happen again – xavi Aug 23 '22 at 14:53
  • Updated my answer. Hope it fits now your data. – Rabinzel Aug 23 '22 at 18:11