0

I am working on a data set which is like in the attached image below.

I have imported the data set which is in CSV in Python using pandas. I am looking to separate the entire data with all columns which has values like "a;b;c","lp;kl;jj" in the column PATR ( that is, values that has data with semi colon in it) into a CSV and other values like ";" and "250" into a separate csv. I have tried splitting the values based on the semi-colon and separating it based on the length but i am not getting the exact match.

Actual data set:

enter image description here

Expected Output 1 (All data which has PATR column with "ANY_DATA and a semicolon")

enter image description here

Expected Output 2 (All data which has PATR column with "only semi colon or only data")

enter image description here

Thanks in advance.

gbppa
  • 35
  • 1
  • 2
  • 10

2 Answers2

1

try this for your requirement :

 def pattern_matcher(y) :
    if y.count(';')<1 or y ==';':
        #case of the string doesn't contain any ';'
        return True
    else :
        """
        this will return True if it contain only ';' without any empty word preceding it , using 
        strip to check if it is only ';'
        """
        return False #all([not x.strip() for x in y.split(";")])

and then apply it to your data frame column

out2 = df2.loc[(df2.part.apply(pattern_matcher))]


part
2   ;
3   250

and

out1 = df2.loc[~(df2.part.apply(pattern_matcher))]

    part
0   A;B;C
1   ip;KL;JH
Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73
  • Thank you so much for the help! But in either of the query i am still getting the row which has ';' along with the columns that has '; with the data'. – gbppa Feb 07 '18 at 04:16
  • @gbppa I've corrected it and improved my answer, let me know if I"m right – Espoir Murhabazi Feb 07 '18 at 20:49
0
mask = df['PATR'].str.contains(';^\w+(;\w+)*$', na=False)
df1 = df[mask]
df2 = df[~mask]

This will work for your test data. I pulled the regex from here.

noslenkwah
  • 1,702
  • 1
  • 17
  • 26
  • It depends on how the 250 is formatted. Since OP never posted any data, I assumed it as a number. Regardless, I fixed it to match 250 as a number or string. – noslenkwah Feb 06 '18 at 21:14
  • Thank you so much for the help! But in either of the query i am still getting the row which has ';' along with the columns that has '; with the data'. – gbppa Feb 07 '18 at 04:16
  • Post some actual data. As in copy-pastable code that will create a test dataframe with everything formatted as you have it. Screenshots don't work. – noslenkwah Feb 07 '18 at 14:22