Pandas DataFrame match word in URL

Question

I have a data frame created by pandas. One of the columns in the data frame has URL's which, I would like to match and count the particular number of occurrences.

My logic is that if it does not return 'None' then at this stage print('Match'), however, that does not appear to work. Here is a sample of my current code, and would appreciate any tips on how to match a value using pandas as I really have just come back from using a lot of R and don't have a lot of experience with Pandas and data frames in python.

Title,URL,Date,Unique Pageviews
Preparing and Starting DS 
career,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:242750,20-Jan-15,163
The Rogue Data Scientist,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:273425,4-May-15,1108
Is it safe to code after one bottle of 
wine?,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:349416,9-Nov-15,1736
Short-Term Forecasting of Electricity 
Demand,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:350421,12-Nov-15,1117
Visual directory of 339 tools. 
Wow!,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:373786,14-Jan-16,4228
8 Types of Data,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:377008,23-Jan-16,2829
Very funny video for people who write 
code,http://www.datasciencecentral.com/forum/topic/show? 
id=6448529:Topic:379578,30-Jan-16,2444

Code Block (Pep8 Requires two line spaces between functions)

def count_set_words(as_pandas):
    reg_exp = re.match('\b/forum', as_pandas['URL']).any()
        if as_pandas['URL'].str.match(reg_exp, case=False, flags=0, na=np.NAN).any():
            print("Match")


def set_new_columns(as_pandas):
   titles_list = ['Year > 2014', 'Forum', 'Blog', 'Python', 'R',
               'Machine_Learning', 'Data_Science', 'Data', 'Analytics']
   for number, word in enumerate(titles_list):
       as_pandas.insert(len(as_pandas.columns), titles_list[number], 0)


def open_as_dataframe(file_name_in):
    reader = pd.read_csv(file_name_in, encoding='windows-1251')
    return reader


def main():
    multi_sets = open_as_dataframe('HDT_data5.txt')
    set_new_columns
    count_set_words(multi_sets)


main()

Can you add some data sample to question, [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) ? — jezrael, Jul 07 '19 at 13:07
Hey buddy, this seems quite simple but we need a sample data set and your expected output. This website was created to answer questions, but the onus is on you to give the good people here enough information to answer your question. — Umar.H, Jul 07 '19 at 13:22
Sorry, I don't seem to know how to separate the data from the code. :( — tcratius, Jul 07 '19 at 13:51
and you are right, it is simple, I don't know why I struggle with asking the question on stakes. — tcratius, Jul 07 '19 at 13:58
Only way I know how to separate two adjacent code blocks is to add a header or text in between. — Researcher, Jul 07 '19 at 14:18
This may be what you are after: https://stackoverflow.com/questions/15411158/pandas-countdistinct-equivalent — Researcher, Jul 07 '19 at 14:23
sigh, I did it again, I created new columns for the output and will check the link tomorrow @Researcher — tcratius, Jul 07 '19 at 14:24

score 1 · Accepted Answer · answered Jul 07 '19 at 14:26

1

reg_exp in the first line of count_words is not a regexp but check if the elements in the URL column match '\b/forum', I think someting like:

df = pd.read_csv(file_name_in, encoding='windows-1251')
for ix, row in df.iterrows():
    re.match('\b/forum', row['url']) is not None:
        print('this is a match')

Would solve your problem

or even simpler

df['is_a_match'] = df.url.apply(lambda row: re.match('\b/forum', row['url']) is not None)

answered Jul 07 '19 at 14:26

Robin Nicole

646
4
17

I don't know how many times I've tried to read python regex and even bought a book and it still doesn't make sense, though I think I get what you mean, so re.compile is just a check of sorts and pretty much redundant unless your writing a book, I can't always afford :) and nice, never seen iterrows being used before. Sorry I was an ass earlier, I spend so much time trying to be good at this stuff to get a job, and still hasn't happened, plays on ya nerves after a while. – tcratius Jul 07 '19 at 21:54
1

A regexp can be seen as a directive. For example '\b/forum' means all the string which have a backslash and form after, not that until here I did not specify any string to modify. Everytime python reads a regexp from a string it translates it into a python object which takes time. If you use your regex only few times there is no problem but if you use the same regexp 100 times you should precompile it not to have to recreate the python object everytime you use the regexp. On the other hand `re.match` will use an existing regexp (precompiled or not) to modify a string of character. – Robin Nicole Jul 07 '19 at 22:08
I could figure out the for loop, however, had to change None to 'None'. The pandas list comprehension style looking answer I can not seem to figure out how to implement that with a count, so if match, add count one, will play with it later. Again, thank you. – tcratius Jul 07 '19 at 22:30
1

If you replace None by 'None' what is in the if will always be true. To cound the number of elements matched you can do `df['is_a_match'].sum()`. Otherwise it means your regexp not correct. You should check if rou regexp behave as expected with few examples. – Robin Nicole Jul 07 '19 at 22:35
1

I checked and you should replace `\b/forum` by `.*(/forum).*` this will match all the chain that contain '/forum' while the other regexp only matched strings equal to forum. I do not know what \b is for but it doesn't work here ;/ – Robin Nicole Jul 07 '19 at 22:58
Yeah, you are right, I just looked at my count and it matches the rows, you must have read my mind with the .*(/forum).* because the other regular expression was giving upzero. So, in my haking away, I probably already have stumbled upon the answer just wrong reg_ex. ps it worked, I really have to get around to thoroughly learning reg-ex, mysl, and a bazillion other things. Originally I tried /*(forum)*/ and it didn't like that. oh well thanks again. – tcratius Jul 08 '19 at 03:06

Pandas DataFrame match word in URL

1 Answers1