0

I have a pd.dataframe with a cell containing lots of information, separated by some custom delimiters. I want to split this information into separate columns. Sample cell looks like this:

price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen

You can notice that at the end of this example there are also '<->' separators, separating some features within one column. I am ok with keeping them inside one column for now.

So my Dataframe looks somewhat like this:

   A  B
0  1  price<=>price<br>price<=>3100<br>(...)
1  2  price<=>price<br>price<=>54000<br>(...)
2  3  price<=>price<br>price<=>135600<br>(...)

So the pattern I can see is that:

  • column names are in between: '< br >' and <=>

  • values are in between: <=> and '< br >'

Is there any smooth way to do this in python? Ideally, I would like to have a solution that splits and puts all values into columns. I could do the column names manually then.

The desired output would be like this:

   A  price   price[currency]  rent (...)
0  1  3100    PLN              600  (...)
1  2  54000   CZK              1000 (...)
2  3  135600  EUR              8000 (...)
Maciej
  • 129
  • 1
  • 2
  • 7
  • you also have `<->` in the string is this a typo or are these something different? – kkawabat Aug 06 '19 at 18:37
  • 1
    also can you give us the desired output for your test string? it's not really clear how the data is formatted from just your description. – kkawabat Aug 06 '19 at 18:40
  • Please take a look at [How to create good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] with sample input, sample output, and what you've tried based on your own research – G. Anderson Aug 06 '19 at 18:59
  • Thank you for your replies, I have tried to edit the question with more detail, hope this helps! – Maciej Aug 06 '19 at 19:22

1 Answers1

0

use str.split() method to split the data on <br> then split the chunks on <=>

str_ = 'price<=>price<br>price<=>3100<br>price[currency]<=>PLN<br>rent<=>price<br>rent<=>600<br>rent[currency]<=>PLN<br>deposit<=>price<br>deposit<=><br>deposit[currency]<=><br>m<=>100<br>rooms_num<=>3<br>building_type<=>tenement<br>floor_no<=>floor_2<br>building_floors_num<=>4<br>building_material<=>brick<br>windows_type<=>plastic<br>heating<=>gas<br>build_year<=>1915<br>construction_status<=>ready_to_use<br>free_from<=><br>rent_to_students<=><br>equipment_types<=><br>security_types<=><br>media_types<=>cable-television<->internet<->phone<br>extras_types<=>balcony<->basement<->separate_kitchen'

#list of str that looks like "<column><=><value>"
row_list = str_.split('<br>') 

#split the string on "<=>" and save the resulting column value pair in a new list
row_cleaned = [row.split('<=>') for row in row_list] 

#convert the list of column value pairs to a column list and val list
column_list, vals_list = zip(*row_cleaned)
print(column_list)
print(vals_list)

column_list:

('price', 'price', 'price[currency]', 'rent', 'rent', 'rent[currency]', 'deposit', 'deposit', 'deposit[currency]', 'm', 'rooms_num', 'building_type', 'floor_no', 'building_floors_num', 'building_material', 'windows_type', 'heating', 'build_year', 'construction_status', 'free_from', 'rent_to_students', 'equipment_types', 'security_types', 'media_types', 'extras_types')

val_list:

('price', '3100', 'PLN', 'price', '600', 'PLN', 'price', '', '', '100', '3', 'tenement', 'floor_2', '4', 'brick', 'plastic', 'gas', '1915', 'ready_to_use', '', '', '', '', 'cable-television<->internet<->phone', 'balcony<->basement<->separate_kitchen')
kkawabat
  • 1,530
  • 1
  • 14
  • 37