PHP Match text and remove entire line from text file

Question

I have a text file and I want to remove lines that contain certain characters completely. For example a text file like this where I want lines that contain Chinese characters to be removed:

A.我不要这些汉字
Ok I see
有人会懂我写的吗？
Why not then?
我看够呛。
This is just an example

$myfile = "somtext.txt";
$handle = fopen($myfile, "r");
$book = fread($handle, filesize($myfile));
fclose($handle);

$book = preg_replace("/\p{Han}+/u","", $book);

echo nl2br($book);

But with this code the Chinese gets deleted ok but the punctuation is left and any alphanumeric characters are left on the line. Moreover, the line itself is still there. It ends up like this:

A.
Ok I see
？
Why not then?
。
This is just an example

But I need it to look like this:

Ok I see
Why not then?
This is just an example

EDIT: I want to do this before converting it to an array.

Or even [`/^.*?\p{Han}.*\n/mu`](https://regex101.com/r/ZtJdAu/2) — bobble bubble, Jul 24 '17 at 17:29
AbraCadaver yours works but it leaves a blank line there. I need it to look like I described. The \n needs to be removed too. bobble bubble that doesn't seem to match the lines. — Hasen, Jul 24 '17 at 17:31
At worst, you could make two passes: One to remove the characters, and another to remove blank lines. — Andy Lester, Jul 24 '17 at 18:02
@Hasen see updated [demo at eval.in](https://eval.in/837026) — bobble bubble, Jul 24 '17 at 18:15
Your code is the best now bobble bubble! Does everything I wanted, including removing the whole line. — Hasen, Jul 25 '17 at 05:35

score 2 · Answer 1 · answered Jul 24 '17 at 19:16

You wrote you want to create an array of lines once the unwanted parts of your file removed. But you can build it according to the loaded (and acceptable) lines. This way you don't have to store in memory the lines you don't want.

To do that, you have to write a generator that yields only correct lines:

function getLine($handle, $buffer = 2048, $sep = "\n") {
    while ( false !== $line = stream_get_line($handle, $buffer, $sep) ) {
        if ( preg_match('~^\P{Han}+$~u', $line) )
            yield $line;
    }
}

$myfile = "somtext.txt";

if ( false === $handle = fopen($myfile, "r") )
    throw new Exception("Unable to open file '$myfile'\n");

$result = iterator_to_array(getLine($handle));

fclose($handle);

print_r($result);

This appears to be the most elegant approach on the page. – mickmackusa Jan 30 '22 at 08:24 — mickmackusa, Jan 30 '22 at 08:24

score 1 · Answer 2 · 2017-07-25T00:41:42.807

(Note- requires the Unicode flag)

You could put in all the properties for all the CJK_xxx blocks.
The sum of all the blocks should give you a script for most all Asian characters.
It's not like there is a separate script for Chinese, Japanese, Korean, Vietnamese.
These characters intermingle between languages. I think just Han might not be enough.

Also, the thing about blocks is that some unused codepoints within them
are reserved for future expansion, but the block reference doesn't change.
It is a way to future-proof Unicode updates where engine writers may not
update their scripts (which contain no unused codepoints) in a timely fashion.

This is just a class to match a single character.

See this http://www.unicode.org/faq/han_cjk.html for info on what CJK stands for.

To find/remove the line, use (?m)^.*?[class].*(?:\r?\n|\z)

[\p{Block=Kangxi_Radicals}\p{Block=CJK_Compatibility}\p{Block=CJK_Compatibility_Forms}\p{Block=CJK_Compatibility_Ideographs}\p{Block=CJK_Compatibility_Ideographs_Supplement}\p{Block=CJK_Radicals_Supplement}\p{Block=CJK_Strokes}\p{Block=CJK_Symbols_And_Punctuation}\p{Block=CJK_Unified_Ideographs}\p{Block=CJK_Unified_Ideographs_Extension_A}\p{Block=CJK_Unified_Ideographs_Extension_B}\p{Block=CJK_Unified_Ideographs_Extension_C}\p{Block=CJK_Unified_Ideographs_Extension_D}\p{Block=CJK_Unified_Ideographs_Extension_E}\p{Block=Enclosed_CJK_Letters_And_Months}]

or, just use the codepoint ranges.
This is for Unicode 9.

[\x{2E80}-\x{2FDF}\x{3000}-\x{303F}\x{31C0}-\x{31EF}\x{3200}-\x{4DBF}\x{4E00}-\x{9FFF}\x{F900}-\x{FAFF}\x{FE30}-\x{FE4F}\x{20000}-\x{2A6DF}\x{2A700}-\x{2CEAF}\x{2F800}-\x{2FA1F}]

andrew · Answer 3 · 2017-07-24T17:27:29.967

0

Here is a regex that will do what you're looking for:

.+?(?=[^\x00-\x7F]).*(?=\n)\n

If you want an example: https://regex101.com/r/U6ngPi/2

EDIT (Explanation):

Starting on a line, look for all characters if ahead there is a non-ASCII character (.+?(?=[^\x00-\x7F]), taken from here)

Match all characters (.*)

Check for the existence of a newline, using similar logic as before for with lookaheads ((?=\n)\n) and then include the newline character

edited Jul 24 '17 at 17:27

answered Jul 24 '17 at 17:22

andrew

448
7
13

This? $book = preg_replace("/.+?(?=[^\x00-\x7F]).*(?=\n)\n/","", $book); – Hasen Jul 24 '17 at 17:27
Yes, that is the PHP portion. – andrew Jul 24 '17 at 17:30
Ok then that removes absolutely everything, not just the Chinese. – Hasen Jul 24 '17 at 17:31
Yep! Did you check out the example link? It shows a live output with a PHP regex parser – andrew Jul 24 '17 at 17:32
Yeah I saw that but it didn't work like that for me. It removed all the text completely. – Hasen Jul 24 '17 at 17:34
The problem I'm guessing is that the regex is enclosed in double quotes. What does single quotes do for you? The problem there is PHP is evaluating \x00 instead of the PCRE parser – andrew Jul 24 '17 at 17:40

score 0 · Answer 4 · answered Jul 24 '17 at 17:56

0

Try this string as the matching one "/(.)\p{Han}+(.)\n/uD"

It captures other (non Chinese) characters as well as the newline at the end.

answered Jul 24 '17 at 17:56

Rob Anthony

1,743
1
13
17

PHP Match text and remove entire line from text file

4 Answers4