4

I have a bunch of names from the web (first name, last name, of people in different countries). Some of the countries have statistics on how many people have each last name, as shown in some places like here.

Well, that Japanese surname list only lists the top 100. I have other lists like for Vietnamese listing the top 20, and other lists the top 50 or 1000 even in some places. But I have real name lists that are up to the 1000+ count. So I might have 2000 Japanese surnames, with only 100 that have listed the actual count of people with that surname.

What I would like to do is built a "faker" sort of library, that generates realistic names based on these statistics. I know how to pick a random element from a weighted array in JavaScript, so once the "weights" (number of people with that name) are included for each name, it is just a matter of plugging it into that algorithm.

My question is, how can I "complete the curve" on the names that don't have a weight on them? That is, say we have an exponential-like curve sort of, from the 20 or 100 names that have weights on them. I would then like to randomly pick names from the remaining unweighted list, and give them a value that places them somewhat realistically in the remaining tail of the curve. How can that be done?

For example, here is a list of Vietnamese names with weights:

Nguyen,38
Tran,11
Le,9.5
Pham,7.1
Huynh,5.1
Phan,4.5
Vu,3.9
Đang,2.1
Bui,2
Do,1.4
Ho,1.3
Ngo,1.3
Duong,1
Ly,0.5

And here is a list without weights:

An
Ân
Bạch
Bành
Bao
Biên
Biện
Cam
Cảnh
Cảnh
Cao
Cái
Cát
Chân
Châu
Chiêm
Chu
Chung
Chử
Cổ
Cù
Cung
Cung
Củng
Cừu
Dịch
Diệp
Doãn
Dũ
Dung
Dư
Dữu
Đái
Đàm
Đào
Đậu
Điền
Đinh
Đoàn
Đồ
Đồng
Đổng
Đường
Giả
Giải
Gia
Giản
Giang
Giáp
Hà
Hạ
Hậ
Hác
Hàn
Hầu
Hình
Hoa
Hoắc
Hoạn
Hồng
Hứa
Hướng
Hy
Kha
Khâu
Khổng
Khuất
Kiều
Kim
Kỳ
Kỷ
La
Lạc
Lai
Lam
Lăng
Lãnh
Lâm
Lận
Lệ
Liên
Liêu
Liễu
Long
Lôi
Lục
Lư
Lữ
Lương
Lưu
Mã
Mạc
Mạch
Mai
Mạnh
Mao
Mẫn
Miêu
Minh
Mông
Ngân
Nghê
Nghiêm
Ngư
Ngưu
Nhạc
Nhan
Nhâm
Nhiếp
Nhiều
Nhung
Ninh
Nông
Ôn
Ổn
Ông
Phí
Phó
Phong
Phòng
Phù
Phùng
Phương
Quách
Quan
Quản
Quang
Quảng
Quế
Quyền
Sài
Sầm
Sử
Tạ
Tào
Tăng
Tân
Tần
Tất
Tề
Thạch
Thai
Thái
Thang
Thành
Thảo
Thân
Thi
Thích
Thiện
Thiệu
Thôi
Thủy
Thư
Thường
Tiền
Tiết
Tiêu
Tiêu
Tô
Tôn
Tôn
Tông
Tống
Trác
Trạch
Trại
Trang
Trầm
Trâu
Trì
Triệu
Trịnh
Trương
Từ
Tư
Tưởng
Úc
Ứng
Vạn
Văn
Vân
Vi
Vĩnh
Vũ
Vũ
Vương
Vưu
Xà
Xầm
Xế
Yên

I would like to randomize the list without weights (easy to do), and then assign each one a weight so it fills out the tail of the curve to some degree, so it feels somewhat realistic. How can this be done? Basically it seems we need to get the "curvature" of the initial weighted curve, and then somehow extend that with new items. It doesn't need to be perfect, but whatever can be done to approximate would be cool. I am not a statistics/math person so I don't really know where to begin on this one.

I don't have an exact outcome I want, I just want something that will generate the tail of the curve to some degree. For example, the start of the list might look like this:

An,0.5
Ân,0.45
Bạch,0.42
Bành,0.40
Bao,0.39
...

To try and visually demonstrate what I'm going after, the black boxes below are the provided data. The dotted boxes would stretch on for a long while, but here I show the start of it. The dotted boxes are what we would fill in in the curve so it fit the shape of the start of the curve.

▐
▐
▐▐
▐▐
▐▐
▐▐▐
▐▐▐ 
▐▐▐▐ 
▐▐▐▐▐▐ 
▐▐▐▐▐▐▐▐▐▐
▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐
▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐░░░░
▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐░░░░░░░░░░░░░░░░░░░░░
▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐▐░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

So basically, the left side of the curve are the few highest values. As it goes to the right, it gets smaller according to "some" pattern. We just need to roughly continue the pattern to the right, so it basically extends the curve.

Lance
  • 75,200
  • 93
  • 289
  • 503

2 Answers2

3

I can't add any javascript code (I'm not fluent in js), but I can point you to sources for more information, and provide the outline of a solution.

Basically, the information you want to derive takes a complexity of O(C2N), according to this answer, or O(n3) according to this one but once you've found the derivation, the answer is relatively easy. If you'd like to derive it yourself, in javascript, you can use TensorFlow to derive the information you want. Otherwise, use Kelvin Schoofs' solution involving WolframAlpha.

After that, it's just a matter of taking the derived equation and plugging in the new "x" values, as Kelvin has done.

Note that, because distributions of names can differ depending on a number of factors, you might want to pay attention to that. The factors can be summarized by an integer K, according to this paper. If I've understood correctly, the "best-fit" of your data would follow a "Yule-Simon" distribution, which you can probably use to get a more accurate distribution of "created" data.

That is:

f(k;ρ) can be used to model, for example, the relative frequency of the kth most frequent word in a large collection of text, which according to Zipf's law is inversely proportional to a (typically small) power of k.

Or, according to this paper, names can be modeled as

P(n)=a⋅n−b⋅e−(n/c)d, where P(n) represents the proportion of surnames whose sizes are no less than n, b is the power exponent, c is the cutoff size of power‐law part, and d is the stretch parameter in the stretched exponential function.

Also, if you have access, this paper might provide more information, but I don't have access to it, so I can't actually verify its utility. I think it can be summarized as saying that surnames can be modeled according to a Pareto distribution, in which case you could plug in your names to that distribution. I believe you would use the corresponding probability density function.

2

I'm no mathematician, so I've simply fitted the data to a y=A*x^B equation using these equations, although Wolfram has some others that might fit your data better. Perhaps some papers around the distribution of (sur)names might hint at a better equation.

Nonetheless, the current prediction doesn't seem too bad:

/** @type {[name: string, weight: number][]} */
const KNOWN_NAMES = [
    ['Nguyen', 38],
    ['Tran', 11],
    ['Le', 9.5],
    ['Pham', 7.1],
    ['Huynh', 5.1],
    ['Phan', 4.5],
    ['Vu', 3.9],
    ['Đang', 2.1],
    ['Bui', 2],
    ['Do', 1.4],
    ['Ho', 1.3],
    ['Ngo', 1.3],
    ['Duong', 1],
    ['Ly', 0.5],
]

/** @type {string[]} */
const UNKNOWN_NAMES = [];
for (let i = 0; i < 20; i++) UNKNOWN_NAMES[i] = `Unknown${i}`;

/**
 * Predicts the A and B for y=A*x^B (y=A*(x**B) in JS notation).
 * Based on https://mathworld.wolfram.com/LeastSquaresFittingPowerLaw.html
 * @param {number[]} data
 * @returns {[A: number, B: number]}
 */
function expCurveFit(data) {
    const n = data.length;
    let sum_ln_xy = 0;
    let sum_ln_x = 0;
    let sum_ln_y = 0;
    let sum_ln_x_pow = 0;
    for (let i = 1; i <= n; i++) {
        const x = i;
        const y = data[x - 1];
        sum_ln_xy += Math.log(x) * Math.log(y);
        sum_ln_x += Math.log(x);
        sum_ln_y += Math.log(y);
        sum_ln_x_pow += Math.log(x) ** 2;
    }
    const b_nom = (n * sum_ln_xy) - (sum_ln_x * sum_ln_y);
    const b_den = (n * sum_ln_x_pow) - (sum_ln_x ** 2);
    const b = b_nom / b_den;
    const a = (sum_ln_y - b * sum_ln_x) / n;
    return [Math.exp(a), b];
}

// Calculating the prediction function
const [A, B] = expCurveFit(KNOWN_NAMES.map(([, w]) => w));
console.log(`Fit: A=${A} B=${B}`);
/** @param {number} index */
const predict = (index) => A * ((index + 1) ** B);

// Show prediction results
console.log('== Known weights ==');
KNOWN_NAMES.forEach(([name, expected], index) => {
    const predicted = predict(index);
    console.log(`- #${index}: ${expected} VS ${predicted} => ${predicted - expected}`);
});
console.log('== Predicted tail ==');
for (let i = 0; i < 10; i++) {
    const index = KNOWN_NAMES.length + i;
    console.log(`- #${index}: ${predict(index)}`);
}
console.log('...');
console.log(`- #${2000}: ${predict(2000)}`);
console.log('...');
console.log(`- #${5000}: ${predict(5000)}`);

// Shuffle UNKNOWN, otherwise if your array is alphabetically sorted, names that
// are alphabetically "higher" will get higher weights. Wouldn't look natural.
// Based on https://stackoverflow.com/a/2450976/14274597
for (let index = UNKNOWN_NAMES.length; index;) {
    const random = Math.floor(Math.random() * index--);
    [UNKNOWN_NAMES[index], UNKNOWN_NAMES[random]] = [UNKNOWN_NAMES[random], UNKNOWN_NAMES[index]];
}

/** @type {[name: string, weight: number][]} */
const NAME_WEIGHTS = [
    // If we want to keep the original weights
    ...KNOWN_NAMES,
    // Now our predicted (offset by the number of KNOWN_WEIGHTS one)
    ...UNKNOWN_NAMES.map((name, i) => [name, predict(i + KNOWN_NAMES.length)]),
];

console.log('== Weighted names ==');
for (const [name, weight] of NAME_WEIGHTS) {
    console.log(`- ${name}${' '.repeat(30 - name.length)} ${weight}`);
}

snippet's console cuts off a lot of logging, but displays the full NAME_WEIGHTS part. Talking about the results below anyway.

For the provided data, this are its predictions VS the original weights:

- #0: 38 VS 43.69867297773659 => 5.698672977736592
- #1: 11 VS 15.989864279951354 => 4.9898642799513535
- #2: 9.5 VS 8.880479579268258 => -0.6195204207317424
- #3: 7.1 VS 5.850881554721921 => -1.2491184452780786
- #4: 5.1 VS 4.233113801814277 => -0.866886198185723
- #5: 4.5 VS 3.2494731198295947 => -1.2505268801704053
- #6: 3.9 VS 2.5984310535459065 => -1.3015689464540934
- #7: 2.1 VS 2.140907162689773 => 0.04090716268977301
- #8: 2 VS 1.8046982250005454 => -0.19530177499945456
- #9: 1.4 VS 1.548946697010319 => 0.14894669701031904
- #10: 1.3 VS 1.34896041963157 => 0.04896041963157005
- #11: 1.3 VS 1.1890208701279552 => -0.11097912987204483
- #12: 1 VS 1.0586914703306205 => 0.0586914703306205
- #13: 0.5 VS 0.9507968333083715 => 0.4507968333083715

Far from perfect, but it's a good start. Besides, for the known names, you can use the original weights. The predicted tail looks quite alright though:

- #14: 0.8602568021432264
- #15: 0.7833833989610162
- #16: 0.717440740163343
- #17: 0.6603605491345175
- #18: 0.6105530322360989
- #19: 0.5667780226345167
- #20: 0.5280552551540235
- #21: 0.4936006647141039
- #22: 0.462780366132426
- #23: 0.4350768809172298
- #24: 0.4100639959533769
- #25: 0.38738780313885823
- #26: 0.3667522293417049
- #27: 0.3479078719427935
- #28: 0.33064329762683886
- #29: 0.3147781974794336
- #30: 0.30015795563424275
- #31: 0.2866493047726054
- #32: 0.27413682483865165
- #33: 0.2625201014677103
...
- #2000: 0.000711597758736851
...
- #5000: 0.0001884684445732886

Here's the final result of reusing/predicting weights and assigning them to names:

- Nguyen                         38
- Tran                           11
- Le                             9.5
- Pham                           7.1
- Huynh                          5.1
- Phan                           4.5
- Vu                             3.9
- Đang                           2.1
- Bui                            2
- Do                             1.4
- Ho                             1.3
- Ngo                            1.3
- Duong                          1
- Ly                             0.5
- Unknown9                       0.8602568021432264
- Unknown17                      0.7833833989610162
- Unknown10                      0.717440740163343
- Unknown1                       0.6603605491345175
- Unknown7                       0.6105530322360989
- Unknown16                      0.5667780226345167
- Unknown18                      0.5280552551540235
- Unknown12                      0.4936006647141039
- Unknown14                      0.462780366132426
- Unknown15                      0.4350768809172298
- Unknown6                       0.4100639959533769
- Unknown13                      0.38738780313885823
- Unknown0                       0.3667522293417049
- Unknown2                       0.3479078719427935
- Unknown11                      0.33064329762683886
- Unknown3                       0.3147781974794336
- Unknown19                      0.30015795563424275
- Unknown8                       0.2866493047726054
- Unknown5                       0.27413682483865165
- Unknown4                       0.2625201014677103

Mind that I shuffled the unknown names first, otherwise you'd assign the highest predicted weight to Unknown0, 2nd highest to Unknown1, ... which will feel very unnatural. E.g. if you do that with an array that's alphabetically sorted, names starting with A will be very common while names starting with Z would be the rarest.

Again, the sudden jump between Ly (0.5) and Unknown9 (0.86) shows the inaccuracy of the fitted curve, but then again, realism doesn't require a perfect distribution of names.

Kelvin Schoofs
  • 8,323
  • 1
  • 12
  • 31