Compare HTML from libcurl with text from file

Question

I'm using libcurl to connect to a website, and getting the HTML, I'm also using LibTidy to extract the text. My purpose is to verify if a sentence from a text file is inside the HTML.

Thanks to LibTidy I have all the text file as one char*. I'm using : char *strstr(const char *one, const char *two) for comparing the two strings. The first one is the string from libcurl and libTidy parsing, and the second one is a string from a text file.

When I'm using the function strstr(..) I have NULL as result. Using the debugger show my that the two string aren't 'encode' in the same way.

I tried to found where the problem was for the String resulting of the Internet connection. And I tried different sample of code to tried to fix it.

The code given by the libcurl website, give me the same problem, the char *memory isn't encoded well, and I can't compare it properly. https://curl.haxx.se/libcurl/c/getinmemory.html

I also tried the code here : https://stackoverflow.com/a/2329792/10160890, and the char *ptr have the same problem.

I expect to be able to compare the String from libcurl and the String from text file.

Have you tried dumping the text you get back as hex so you can see the values of the characters? Are you sure that `strlen(in_str)` is returning the right value? Seems like a good task for a debugger so you examine what's going on. — Retired Ninja, May 17 '19 at 20:07
Yes the debugguer help me a lot, I think the problem came from the function with tidy parsing. — axel7083, May 17 '19 at 21:51
You should revert the edit. The new text is not an answerable question. — R.. GitHub STOP HELPING ICE, May 18 '19 at 14:00
What is the character encoding of the text file? (It appears not to be compatible with ASCII so why have you referenced ASCII?) — Tom Blodget, May 18 '19 at 16:08
Note: Despite debuggers being incredibly powerful, you should not expect one to know the character encoding of data in `char` data types. — Tom Blodget, May 18 '19 at 16:11

score 0 · Accepted Answer · answered May 17 '19 at 20:13

0

There is no need to convert. Any ASCII text is UTF-8 text, so you just search for it as-is using strstr. This is pretty much the whole point of UTF-8.

answered May 17 '19 at 20:13

R.. GitHub STOP HELPING ICE

208,859
35
376
711

The text from the file is encoded properly (See on the debugguer) but the String from the tidy parsing isn't right, that's why the function "strstr" return NULL value, but I would like to found a way to encode both in the same way to compare them. – axel7083 May 17 '19 at 21:52
It looks to me like your debugger is just configured to think strings are latin1 or windows1252 or some other backwards encoding rather than UTF-8. – R.. GitHub STOP HELPING ICE May 17 '19 at 22:39
There is no only the debugger, because the function strstr return NULL, I guess libcurl don't return the value in UTF-8. I tried the following code https://stackoverflow.com/questions/2329571/c-libcurl-get-output-into-a-string and the result is the same, in the debugger, the String isn't encoded well. – axel7083 May 18 '19 at 10:46
1

I think you're misdiagnosing your problem. – R.. GitHub STOP HELPING ICE May 18 '19 at 14:00

Compare HTML from libcurl with text from file

1 Answers1