• 5 Posts
  • 19 Comments
Joined 2 years ago
cake
Cake day: June 11th, 2023

help-circle







  • Very interesting. A trove of experience and practical knowledge.

    They were able to anticipate most of the loss scenarios in advance because they saw that logical arguments were not prevailing; when that happens, ““there’s only one explanation and that’s an alternative motive””. His ““number one recommendation”” is to ensure, even before the project gets started, that it has the right champion and backing inside the agency or organization; that is the real determiner for whether a project will succeed or fail.

    Not very surprising, but still tragic and sad.


  • I love Nushell in Windows Terminal with Starship as an evolution and a leap of shell. Structured data, native format transformations, strong querying capabilities, expressive state information.

    I was surprised that the linked article went an entirely different direction. It seems mainly driven by mouse interactions, but I think it has interesting suggestions or ideas even if you disregard mouse control or make it optional.







  • I would separate concerns. For the scraping, I would dump data as json onto disk. I would consider the folder structure I put them into, whether as individual files, or a JSON document per line in bigger files for grouping. If the website has good URL structure, the path could be useful for speaking author and or id identifiers in folders or files.

    Storing json as text is simple. Depending on the amount, storing plain text is wasteful, and simple text compression could significantly reduce storage size. For text-only stories it’s unlikely to become significant though, and not compressing makes the scraping process, and potentially validating completeness of scraped data simpler.

    I would then keep this data separate from any modifications or prototyping I would do regarding modification or extension of data and presentation/interfacing.







  • and include expensive endpoints like git blame, every page of every git log, and every commit in your repository. They do so using random User-Agents from tens of thousands of IP addresses, each one making no more than one HTTP request, trying to blend in with user traffic.

    That’s insane. They also mention crawling happening every 6 hours instead of only once. And the vast majority of traffic coming from a few AI companies.

    It’s a shame. The US won’t regulate - and certainly not under the current administration. China is unlikely to.

    So what can be done? Is this how the internet splits into authorized and not? Or into largely blocked areas? Maybe responses could include errors that humans could identify and ignore but LLMS would not to poison them?

    When you think about the economic and environmental cost of this it’s insane. I knew AI is expensive to train and run. But now I have to consider where they leech from for training and live queries too.