-1

I get this data from a crawler and want to transfer it into a beautiful dataframe.

Now what I have is like:

+-----------+------------------------------------------+------------+---------------------+
| HotelName | RoomType                                 | RoomFloor  | RoomPrice           |
+-----------+------------------------------------------+------------+---------------------+
| Hotel1    | Standard,Standard,Standard,Deluxe,Deluxe | 10F,20F    | 100,105,108,200,205 |
| Hotel2    | Standard,Standard,Deluxe,Deluxe,Grande   | 30F,40F,50F| 90,95,250,240,300   |
+-----------+------------------------------------------+------------+---------------------+

What I want finally is like:

+-----------+----------+-----------+-----------+
| HotelName | RoomType | RoomFloor | RoomPrice |
+-----------+----------+-----------+-----------+
| Hotel1    | Standard | 10F       | 100       |
| Hotel1    | Standard | 10F       | 105       |
| Hotel1    | Standard | 10F       | 108       |
| Hotel1    | Deluxe   | 20F       | 200       |
| Hotel1    | Deluxe   | 20F       | 205       |
| Hotel2    | Standard | 30F       | 90        |
| Hotel2    | Standard | 30F       | 95        |
| Hotel2    | Deluxe   | 40F       | 250       |
| Hotel2    | Deluxe   | 40F       | 240       |
| Hotel2    | Grande   | 50F       | 300       |
+-----------+----------+-----------+-----------+

I am new to Python and I could not handle this. Can anyone help? Thanks a lot!

AlexHon
  • 15
  • 6
  • 1
    I think this is main problem mapping between `RoomFloor` and `RoomPrice`. How is possible know if in second row first 2 room price are in 30F and second 2 in 40F ? And no first 3 in 30F and last in 40F? – jezrael Jun 28 '17 at 07:12
  • Can you post your code? – Daniel Jun 28 '17 at 07:13
  • 4
    It may be easier to crawl the data differently than trying to transform the dataframe. – DeepSpace Jun 28 '17 at 07:15
  • if each type of room is on a seperate floor it might be done, but else this releation between RoomFloor and RoomType is unclear indeed. – Maarten Fabré Jun 28 '17 at 07:18
  • I use import.io to crawl those data because I am not a professional programmer. I tried to solve this when I setting up the rules in import.io, but failed. Now what I have is like this, which drives me crazy... – AlexHon Jun 28 '17 at 07:18
  • The total number of `RoomFloor` is related to the `RoomType`. Each `RoomType` has several quotes from various agents, therefore there are several `RoomPrice`. – AlexHon Jun 28 '17 at 07:22
  • @MaartenFabré - Yes, is necessary define mapping between `RoomType` - `RoomFloor` or `RoomPrice` - `RoomFloor`. And create general solution - I am afraid but it seems it is impossible. – jezrael Jun 28 '17 at 07:24
  • @AlexHon - I think you need define `RoomFloor` for each room separately like `30F,30F,40F,40F,50F`. But if data are parsed it can be only nice wish... – jezrael Jun 28 '17 at 07:30
  • I refer to [this post](https://stackoverflow.com/questions/38651008/splitting-multiple-columns-into-rows-in-pandas-dataframe) can solve the problem by mapping `RoomType` and `RoomPrice`, but I dont know what's next... I tried `pd.groupby()` methods but no solution comes so far... – AlexHon Jun 28 '17 at 07:34
  • @AlexHon - Solution for dataframe for `RoomType`, `RoomPrice` and `HotelName` is possible, because map for each row only one value of hotel. But maping `RoomFloor` is impossible, unfortunately. – jezrael Jun 28 '17 at 07:39
  • One question - is possible says in each floor is only one type of rooms? Is not possible mixed like in `10F` are `Standard` with `Deluxe` rooms? – jezrael Jun 28 '17 at 07:46
  • @jezrael unfortunately, sometimes they mix together... – AlexHon Jun 28 '17 at 07:48
  • @AlexHon I was afraid about it. So need change parser for define floor for each room separately. Is it possible? – jezrael Jun 28 '17 at 07:50
  • @jezrael I'm afraid not since the original website is ugly... but I will go back and check again. Thanks! – AlexHon Jun 28 '17 at 07:59
  • @AlexHon Are RoomTypes ordered? Your example is `Standard,Standard,Deluxe,Deluxe,Grande `, can it be `Standard,Deluxe,Standard, Grande,Deluxe`? – ayhan Jun 28 '17 at 08:01
  • @ayhan they are ordered, and their RoomFloor follow the order. – AlexHon Jun 28 '17 at 08:04
  • @AlexHon In that case my answer should work. – ayhan Jun 28 '17 at 08:06

3 Answers3

0

Solution if are define RoomFloors separately:

print (df)
  HotelName                                  RoomType            RoomFloor  \
0    Hotel1  Standard,Standard,Standard,Deluxe,Deluxe  10F,10F,10F,20F,20F   
1    Hotel2    Standard,Standard,Deluxe,Deluxe,Grande  30F,30F,40F,40F,50F   

             RoomPrice  
0  100,105,108,200,205  
1    90,95,250,240,300  

cols = ['RoomType','RoomFloor','RoomPrice']
a = df[cols].apply(lambda x: x.str.split(',', expand=True).stack()).reset_index(1, drop=True)
df = df.drop(cols, axis=1).join(a).reset_index(drop=True)
print (df)
  HotelName  RoomType RoomFloor RoomPrice
0    Hotel1  Standard       10F       100
1    Hotel1  Standard       10F       105
2    Hotel1  Standard       10F       108
3    Hotel1    Deluxe       20F       200
4    Hotel1    Deluxe       20F       205
5    Hotel2  Standard       30F        90
6    Hotel2  Standard       30F        95
7    Hotel2    Deluxe       40F       250
8    Hotel2    Deluxe       40F       240
9    Hotel2    Grande       50F       300
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

I tried to reproduce a DataFrame, which I suppose should be the same as posted:

import pandas as pd

raw_data = {'HotelName': ['Hotel1', 'Hotel2'],
            'RoomType': ['Standard,Standard,Standard,Deluxe,Deluxe', 'Standard,Standard,Deluxe,Deluxe,Grande'],
            'RoomFloor': ['10F,20F', '30F,40F,50F'],
            'RoomPrice': ['100,105,108,200,205', '90,95,250,240,300']}

data = pd.DataFrame(raw_data)

I guess the module 'orderedset' may help, and hopefully, the following code might solve your problem:

from ordered_set import OrderedSet # revise 'orderedset' to 'ordered_set'

cols_ordered = ['HotelName', 'RoomType', 'RoomFloor', 'RoomPrice']
data = data[cols_ordered]

data = data[['HotelName', 'RoomType', 'RoomFloor', 'RoomPrice']].applymap(lambda x: x.split(','))
dummies = data.applymap(lambda x: len(x)).apply(max, 1)

for i in range(len(data)):
    room_type, room_floor = data[['RoomType', 'RoomFloor']].iloc[i]
    type_floor_dict = dict(zip(OrderedSet(room_type), room_floor))
    data['RoomFloor'].iloc[i] = [type_floor_dict[t] for t in room_type]
    data['HotelName'].iloc[i] *= dummies[i]

new_data = [pd.DataFrame(data.loc[i].tolist(), index=cols_ordered).T for i in data.index]
new_data = pd.concat(new_data, ignore_index=True)

print(new_data)
AlexHon
  • 15
  • 6
mikeqfu
  • 329
  • 2
  • 10
-1

I came up with this solution please have a look

def func(row):
    dic = []
    RoomType = row['RoomType'].split(",")
    RoomPrice = row['RoomPrice'].split(",")
    RoomFloor = row['RoomFloor'].split(",")
    current_room_type = RoomType[0]
    j = 0
    for index, x in enumerate(RoomType):
        if current_room_type != x:
            j+=1
            current_room_type = x
        dic.append({"HotelName": row["HotelName"],"RoomType": x, "RoomPrice": RoomPrice[index], "RoomFloor": RoomFloor[j]})
    return dic

print pd.DataFrame(df.apply(func, axis=1).sum())
Akshay Kandul
  • 592
  • 4
  • 10