0

This is a column with data and non ascii characters

Summary 1

United Kingdom - ��Global Consumer Technology - ��American Express 
United Kingdom - ��VP Technology - Founder - ��Hogarth Worldwide
Aberdeen - ��SeniorCore Analysis Specialist - ��COREX Group
London, - ��ED, Equit Technology, London - ��Morgan Stanley
United Kingdom - ��Chief Officer, Group Technology - ��BP

How split them and save in different column

The code i used is:

import io
import pandas as pd

df = pd.read_csv("/home/vipul/Desktop/dataminer.csv", sep='\s*\+.*?-\s*')
df = df.reset_index()
df.columns = ["First Name", "Last Name", "Email", "Profile URL", "Summary 1", "Summary 2"]

df.to_csv("/home/vipul/Desktop/new.csv")
cs95
  • 379,657
  • 97
  • 704
  • 746
Vipul Rao
  • 1,495
  • 2
  • 10
  • 15

2 Answers2

2

Say, you have a column in a series like this:

s

0    United Kingdom - ��Global Consumer Technolog...
1    United Kingdom - ��VP Technology - Founder -...
2    Aberdeen - ��SeniorCore Analysis Specialist ...
3    London, - ��ED, Equit Technology, London - �...
4    United Kingdom - ��Chief Officer, Group Tech...
Name: Summary 1, dtype: object

Option 1
Expanding on this answer, you can split on non-ascii characters using str.split:

s.str.split(r'-\s*[^\x00-\x7f]+', expand=True)

                 0                                 1                  2
0  United Kingdom        Global Consumer Technology    American Express
1  United Kingdom           VP Technology - Founder   Hogarth Worldwide
2        Aberdeen    SeniorCore Analysis Specialist         COREX Group
3         London,      ED, Equit Technology, London      Morgan Stanley
4  United Kingdom   Chief Officer, Group Technology                  BP

Option 2
str.extractall + unstack:

s.str.extractall('([\x00-\x7f]+)')[0].str.rstrip(r'- ').unstack()

match               0                                1                  2
0      United Kingdom       Global Consumer Technology   American Express
1      United Kingdom          VP Technology - Founder  Hogarth Worldwide
2            Aberdeen   SeniorCore Analysis Specialist        COREX Group
3             London,     ED, Equit Technology, London     Morgan Stanley
4      United Kingdom  Chief Officer, Group Technology                 BP
cs95
  • 379,657
  • 97
  • 704
  • 746
  • @COLDSPEED split does not work on my system is there any other way. – Vipul Rao Feb 20 '18 at 10:29
  • @VipulRao Can you see my edit? Why is it that my answers don't work on only your machine? :/ – cs95 Feb 20 '18 at 10:33
  • @coldspeed yeah i saw the edit but it does not work on my pc pandas is up to date and python3 is also installed can someone help me with this. – Vipul Rao Feb 20 '18 at 10:36
  • 1
    @VipulRao How am I supposed to help you? Can I sudo into your machine and write your code for you? Cmon, please try and at least figure out _why_ it doesn't work. You did the same thing with your last question. – cs95 Feb 20 '18 at 10:39
  • @coldspeed yeah ill figure it out and thank you so much. – Vipul Rao Feb 20 '18 at 10:41
0

Another approach :

a
0   United Kingdom - ��Global Consumer Technolog...
1   United Kingdom - ��VP Technology - Founder -...
2   Aberdeen - ��SeniorCore Analysis Specialist ...
3   London, - ��ED, Equit Technology, London - �...
4   United Kingdom - ��Chief Officer, Group Tech...

Use this function to extract assci char (where Unicode code point is superior to 128 ) using ord build-in function

def extract_ascii(x):
    string_list = filter(lambda y : ord(y) < 128, x)
    return ''.join(string_list)

and apply it to columns.

df1.a.apply(extract_ascii).str.split('-', expand=True)

here is the results :

             0          1                              2           3
0   United Kingdom  Global Consumer Technology  American Express    None
1   United Kingdom  VP Technology   Founder Hogarth Worldwide
2   Aberdeen    SeniorCore Analysis Specialist  COREX Group None
3   London, ED, Equit Technology, London    Morgan Stanley  None
4   United Kingdom  Chief Officer, Group Technology BP  None
Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73